PiLaMIM: Toward Richer Visual Representations by Integrating Pixel and Latent Masked Image Modeling
Abstract
In Masked Image Modeling (MIM), two primary methods exist: Pixel MIM and Latent MIM, each utilizing different reconstruction targets, raw pixels and latent representations, respectively. Pixel MIM tends to capture low-level visual details such as color and texture, while Latent MIM focuses on high-level semantics of an object. However, these distinct strengths of each method can lead to suboptimal performance in tasks that rely on a particular level of visual features. To address this limitation, we propose PiLaMIM, a unified framework that combines Pixel MIM and Latent MIM to integrate their complementary strengths. Our method uses a single encoder along with two distinct decoders: one for predicting pixel values and another for latent representations, ensuring the capture of both high-level and low-level visual features. We further integrate the $\texttt{[CLS]}\$ token into the reconstruction process to aggregate global context, enabling the model to capture more semantic information. Extensive experiments demonstrate that PiLaMIM outperforms key baselines such as MAE, I-JEPA and BootMAE in most cases, proving its effectiveness in extracting richer visual representations.
Cite
Text
Lee et al. "PiLaMIM: Toward Richer Visual Representations by Integrating Pixel and Latent Masked Image Modeling." NeurIPS 2024 Workshops: SSL, 2024.Markdown
[Lee et al. "PiLaMIM: Toward Richer Visual Representations by Integrating Pixel and Latent Masked Image Modeling." NeurIPS 2024 Workshops: SSL, 2024.](https://mlanthology.org/neuripsw/2024/lee2024neuripsw-pilamim/)BibTeX
@inproceedings{lee2024neuripsw-pilamim,
title = {{PiLaMIM: Toward Richer Visual Representations by Integrating Pixel and Latent Masked Image Modeling}},
author = {Lee, Junmyeong and Hwang, Eui Jun and Cho, Sukmin and Park, Jong C.},
booktitle = {NeurIPS 2024 Workshops: SSL},
year = {2024},
url = {https://mlanthology.org/neuripsw/2024/lee2024neuripsw-pilamim/}
}