Enhancing Vision-Language Model with Unmasked Token Alignment
Abstract
Contrastive pre-training on image-text pairs, exemplified by CLIP, becomes a standard technique for learning multi-modal visual-language representations. Although CLIP has demonstrated remarkable performance, training it from scratch on noisy web-scale datasets is computationally demanding. On the other hand, mask-then-predict pre-training approaches, like Masked Image Modeling (MIM), offer efficient self-supervised learning for single-modal representations. This paper introduces $\textbf{U}$nmasked $\textbf{T}$oken $\textbf{A}$lignment ($\textbf{UTA}$), a method that leverages existing CLIP models to further enhance its vision-language representations. UTA trains a Vision Transformer (ViT) by aligning unmasked visual tokens to the corresponding image tokens from a frozen CLIP vision encoder, which automatically aligns the ViT model with the CLIP text encoder. The pre-trained ViT can be directly applied for zero-shot evaluation even without training on image-text pairs. Compared to MIM approaches, UTA does not suffer from training-finetuning inconsistency and is much more training-efficient by avoiding using the extra $\mathrm{[MASK]}$ tokens. Extensive experimental results demonstrate that UTA can enhance CLIP models and outperform existing MIM methods on various uni- and multi-modal benchmarks.
Cite
Text
Liu et al. "Enhancing Vision-Language Model with Unmasked Token Alignment." Transactions on Machine Learning Research, 2024.Markdown
[Liu et al. "Enhancing Vision-Language Model with Unmasked Token Alignment." Transactions on Machine Learning Research, 2024.](https://mlanthology.org/tmlr/2024/liu2024tmlr-enhancing/)BibTeX
@article{liu2024tmlr-enhancing,
title = {{Enhancing Vision-Language Model with Unmasked Token Alignment}},
author = {Liu, Jihao and Zheng, Jinliang and Liu, Boxiao and Liu, Yu and Li, Hongsheng},
journal = {Transactions on Machine Learning Research},
year = {2024},
url = {https://mlanthology.org/tmlr/2024/liu2024tmlr-enhancing/}
}