CAE V2: Context Autoencoder with CLIP Latent Alignment
Abstract
Masked image modeling (MIM) learns visual representations by predicting the masked patches on a pre-defined target. Inspired by MVP(Wei et al., 2022b) that displays impressive gains with CLIP, in this work, we also employ the semantically rich CLIP latent as target and further tap its potential by introducing a new MIM pipeline, CAE v2, to learn a high-quality encoder and facilitate model convergence on the pre-training task. CAE v2 is an improved variant of CAE (Chen et al., 2023), applying the CLIP latent on two pretraining tasks, i.e., visible latent alignment and masked latent alignment. Visible latent alignment directly mimics the visible latent representations from the encoder to the corresponding CLIP latent, which is beneficial for facilitating model convergence and improving the representative ability of the encoder. Masked latent alignment predicts the representations of masked patches within the feature space of CLIP latent as standard MIM task does, effectively aligning the representations computed from the encoder and the regressor into the same domain. We pretrain CAE v2 on ImageNet-1K images and evaluate on various downstream vision tasks, including image classification, semantic segmentation, object detection and instance segmentation. Experiments show that our CAE v2 achieves competitive performance and even outperforms the CLIP vision encoder, demonstrating the effectiveness of our method. Code is available at https://github.com/Atten4Vis/CAE.
Cite
Text
Zhang et al. "CAE V2: Context Autoencoder with CLIP Latent Alignment." Transactions on Machine Learning Research, 2023.Markdown
[Zhang et al. "CAE V2: Context Autoencoder with CLIP Latent Alignment." Transactions on Machine Learning Research, 2023.](https://mlanthology.org/tmlr/2023/zhang2023tmlr-cae/)BibTeX
@article{zhang2023tmlr-cae,
title = {{CAE V2: Context Autoencoder with CLIP Latent Alignment}},
author = {Zhang, Xinyu and Chen, Jiahui and Yuan, Junkun and Chen, Qiang and Wang, Jian and Wang, Xiaodi and Han, Shumin and Chen, Xiaokang and Pi, Jimin and Yao, Kun and Han, Junyu and Ding, Errui and Wang, Jingdong},
journal = {Transactions on Machine Learning Research},
year = {2023},
url = {https://mlanthology.org/tmlr/2023/zhang2023tmlr-cae/}
}