Introducing Routing Functions to Vision-Language Parameter-Efficient Fine-Tuning with Low-Rank Bottlenecks

Abstract

Mainstream parameter-efficient fine-tuning (PEFT) methods, such as LoRA or Adapter, project a model’s hidden states to a lower dimension, allowing pre-trained models to adapt to new data through this low-rank bottleneck. However, PEFT tasks involving multiple modalities, like vision-language (VL) tasks, require not only adaptation to new data but also learning the relationship between different modalities. Targeting at VL PEFT tasks, we propose a family of operations, called routing functions, to enhance VL alignment in the low-rank bottlenecks. These feature routing functions adopt linear operations and do not introduce new trainable parameters. In-depth analyses are conducted to study their behavior. In various VL PEFT settings, the routing functions significantly improve performance of the original PEFT methods, achieving over 20% improvement on VQAv2 (RoBERT alarge +ViT-L/16) and 30% on COCO Captioning (GPT2-medium+ViT-L/16). Also when fine-tuning a pre-trained multimodal model such as CLIP-BART, we observe smaller but consistent improvements across a range of VL PEFT tasks. Our code is available at https://github. com/tingyu215/Routing_VLPEFT.

Cite

Text

Qu et al. "Introducing Routing Functions to Vision-Language Parameter-Efficient Fine-Tuning with Low-Rank Bottlenecks." Proceedings of the European Conference on Computer Vision (ECCV), 2024. doi:10.1007/978-3-031-73223-2_17

Markdown

[Qu et al. "Introducing Routing Functions to Vision-Language Parameter-Efficient Fine-Tuning with Low-Rank Bottlenecks." Proceedings of the European Conference on Computer Vision (ECCV), 2024.](https://mlanthology.org/eccv/2024/qu2024eccv-introducing/) doi:10.1007/978-3-031-73223-2_17

BibTeX

@inproceedings{qu2024eccv-introducing,
  title     = {{Introducing Routing Functions to Vision-Language Parameter-Efficient Fine-Tuning with Low-Rank Bottlenecks}},
  author    = {Qu, Tingyu and Tuytelaars, Tinne and Moens, Marie-Francine},
  booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
  year      = {2024},
  doi       = {10.1007/978-3-031-73223-2_17},
  url       = {https://mlanthology.org/eccv/2024/qu2024eccv-introducing/}
}