CAE V2: Context Autoencoder with CLIP Latent Alignment

Xinyu Zhang, Jiahui Chen, Junkun Yuan, Qiang Chen, Jian Wang, Xiaodi Wang, Shumin Han, Xiaokang Chen, Jimin Pi, Kun Yao, Junyu Han, Errui Ding, Jingdong Wang

TMLR 2023

/tmlr/2023/zhang2023tmlr-cae/

Abstract

Masked image modeling (MIM) learns visual representations by predicting the masked patches on a pre-defined target. Inspired by MVP(Wei et al., 2022b) that displays impressive gains with CLIP, in this work, we also employ the semantically rich CLIP latent as target and further tap its potential by introducing a new MIM pipeline, CAE v2, to learn a high-quality encoder and facilitate model convergence on the pre-training task. CAE v2 is an improved variant of CAE (Chen et al., 2023), applying the CLIP latent on two pretraining tasks, i.e., visible latent alignment and masked latent alignment. Visible latent alignment directly mimics the visible latent representations from the encoder to the corresponding CLIP latent, which is beneficial for facilitating model convergence and improving the representative ability of the encoder. Masked latent alignment predicts the representations of masked patches within the feature space of CLIP latent as standard MIM task does, effectively aligning the representations computed from the encoder and the regressor into the same domain. We pretrain CAE v2 on ImageNet-1K images and evaluate on various downstream vision tasks, including image classification, semantic segmentation, object detection and instance segmentation. Experiments show that our CAE v2 achieves competitive performance and even outperforms the CLIP vision encoder, demonstrating the effectiveness of our method. Code is available at https://github.com/Atten4Vis/CAE.

PDF TMLR Code Semantic Scholar

Cite

Text

Zhang et al. "CAE V2: Context Autoencoder with CLIP Latent Alignment." Transactions on Machine Learning Research, 2023.

Markdown

[Zhang et al. "CAE V2: Context Autoencoder with CLIP Latent Alignment." Transactions on Machine Learning Research, 2023.](https://mlanthology.org/tmlr/2023/zhang2023tmlr-cae/)

BibTeX

@article{zhang2023tmlr-cae,
  title     = {{CAE V2: Context Autoencoder with CLIP Latent Alignment}},
  author    = {Zhang, Xinyu and Chen, Jiahui and Yuan, Junkun and Chen, Qiang and Wang, Jian and Wang, Xiaodi and Han, Shumin and Chen, Xiaokang and Pi, Jimin and Yao, Kun and Han, Junyu and Ding, Errui and Wang, Jingdong},
  journal   = {Transactions on Machine Learning Research},
  year      = {2023},
  url       = {https://mlanthology.org/tmlr/2023/zhang2023tmlr-cae/}
}