Visual Language Alignment Tuning

Abstract

Foundation models like CLIP are pivotal for advancing research in vision-language learning, as they simultaneously learn modality-specific representations and cross-modal alignment. However, training these models is resource-intensive, requiring hundreds of millions of image-text pairs and hundreds of GPUs, creating a barrier for advancing research on multimodal alignment. In this paper, we introduce the \textbf{S}wift \textbf{A}lignment of \textbf{I}mage and \textbf{L}anguage (SAIL) framework, which focuses on vision-language alignment by tuning a lightweight alignment layer added on top of frozen pretrained single-modality models. SAIL drastically reduces computational demands, requiring only a single GPU to align the pretrained feature spaces.

Cite

Text

Zhang et al. "Visual Language Alignment Tuning." NeurIPS 2024 Workshops: AFM, 2024.

Markdown

[Zhang et al. "Visual Language Alignment Tuning." NeurIPS 2024 Workshops: AFM, 2024.](https://mlanthology.org/neuripsw/2024/zhang2024neuripsw-visual/)

BibTeX

@inproceedings{zhang2024neuripsw-visual,
  title     = {{Visual Language Alignment Tuning}},
  author    = {Zhang, Le and Yang, Qian and Agrawal, Aishwarya},
  booktitle = {NeurIPS 2024 Workshops: AFM},
  year      = {2024},
  url       = {https://mlanthology.org/neuripsw/2024/zhang2024neuripsw-visual/}
}