Efficient Multimodal Alignment: To Freeze or Not to Freeze?
Abstract
Language-image pretraining creates a joint representation space between the two modalities where images and texts with similar semantic information lay close to each other. Language-image models are often trained from scratch without taking advantage of unimodal pretrained models. By aligning the representation spaces of two modality-specific encoders, our model achieves 74.7% accuracy on the ImagenNet1K validation set, at two orders of magnitude lower training cost. In this work, we highlight the importance of unfreezing the CLS tokens of uni-modal transformer encoders to create a joint embedding space. Freezing the image and text CLS tokens reduces the mean accuracy from 37.5% to 19.4% on the 38 evaluation benchmarks.
Cite
Text
Aczel and Wattenhofer. "Efficient Multimodal Alignment: To Freeze or Not to Freeze?." NeurIPS 2023 Workshops: UniReps, 2023.Markdown
[Aczel and Wattenhofer. "Efficient Multimodal Alignment: To Freeze or Not to Freeze?." NeurIPS 2023 Workshops: UniReps, 2023.](https://mlanthology.org/neuripsw/2023/aczel2023neuripsw-efficient/)BibTeX
@inproceedings{aczel2023neuripsw-efficient,
title = {{Efficient Multimodal Alignment: To Freeze or Not to Freeze?}},
author = {Aczel, Till and Wattenhofer, Roger},
booktitle = {NeurIPS 2023 Workshops: UniReps},
year = {2023},
url = {https://mlanthology.org/neuripsw/2023/aczel2023neuripsw-efficient/}
}