Visual Language Alignment Tuning
Abstract
Foundation models like CLIP are pivotal for advancing research in vision-language learning, as they simultaneously learn modality-specific representations and cross-modal alignment. However, training these models is resource-intensive, requiring hundreds of millions of image-text pairs and hundreds of GPUs, creating a barrier for advancing research on multimodal alignment. In this paper, we introduce the \textbf{S}wift \textbf{A}lignment of \textbf{I}mage and \textbf{L}anguage (SAIL) framework, which focuses on vision-language alignment by tuning a lightweight alignment layer added on top of frozen pretrained single-modality models. SAIL drastically reduces computational demands, requiring only a single GPU to align the pretrained feature spaces.
Cite
Text
Zhang et al. "Visual Language Alignment Tuning." NeurIPS 2024 Workshops: AFM, 2024.Markdown
[Zhang et al. "Visual Language Alignment Tuning." NeurIPS 2024 Workshops: AFM, 2024.](https://mlanthology.org/neuripsw/2024/zhang2024neuripsw-visual/)BibTeX
@inproceedings{zhang2024neuripsw-visual,
title = {{Visual Language Alignment Tuning}},
author = {Zhang, Le and Yang, Qian and Agrawal, Aishwarya},
booktitle = {NeurIPS 2024 Workshops: AFM},
year = {2024},
url = {https://mlanthology.org/neuripsw/2024/zhang2024neuripsw-visual/}
}