Efficient Multimodal Alignment: To Freeze or Not to Freeze?

Abstract

Language-image pretraining creates a joint representation space between the two modalities where images and texts with similar semantic information lay close to each other. Language-image models are often trained from scratch without taking advantage of unimodal pretrained models. By aligning the representation spaces of two modality-specific encoders, our model achieves 74.7% accuracy on the ImagenNet1K validation set, at two orders of magnitude lower training cost. In this work, we highlight the importance of unfreezing the CLS tokens of uni-modal transformer encoders to create a joint embedding space. Freezing the image and text CLS tokens reduces the mean accuracy from 37.5% to 19.4% on the 38 evaluation benchmarks.

Cite

Text

Aczel and Wattenhofer. "Efficient Multimodal Alignment: To Freeze or Not to Freeze?." NeurIPS 2023 Workshops: UniReps, 2023.

Markdown

[Aczel and Wattenhofer. "Efficient Multimodal Alignment: To Freeze or Not to Freeze?." NeurIPS 2023 Workshops: UniReps, 2023.](https://mlanthology.org/neuripsw/2023/aczel2023neuripsw-efficient/)

BibTeX

@inproceedings{aczel2023neuripsw-efficient,
  title     = {{Efficient Multimodal Alignment: To Freeze or Not to Freeze?}},
  author    = {Aczel, Till and Wattenhofer, Roger},
  booktitle = {NeurIPS 2023 Workshops: UniReps},
  year      = {2023},
  url       = {https://mlanthology.org/neuripsw/2023/aczel2023neuripsw-efficient/}
}