VL-PET: Vision-and-Language Parameter-Efficient Tuning via Granularity Control

Abstract

As the model size of pre-trained language models (PLMs) grows rapidly, full fine-tuning becomes prohibitively expensive for model training and storage. In vision-and-language (VL), parameter-efficient tuning (PET) techniques are proposed to integrate modular modifications (e.g., Adapter) into encoder-decoder PLMs. By tuning a small set of trainable parameters, these techniques perform on par with full fine-tuning. However, excessive modular modifications and neglecting the unique abilities of the encoders and decoders can lead to performance degradation, while existing PET techniques (e.g., VL-Adapter) overlook these issues. In this paper, we propose a Vision-and-Language Parameter-Efficient Tuning (VL-PET) framework to impose effective control over modular modifications via a novel granularity-controlled mechanism. Considering different granularity-controlled matrices generated by this mechanism, a variety of model-agnostic VL-PET modules can be instantiated from our framework for better efficiency and effectiveness trade-offs. We further propose lightweight designs to enhance VL alignment and modeling for the encoders and maintain text generation for the decoders. Extensive experiments conducted on four image-text tasks and four video-text tasks demonstrate the efficiency, effectiveness, scalability and transferability of our VL-PET framework. In particular, our VL-PET-large significantly outperforms full fine-tuning by 2.39% (2.61%) and VL-Adapter by 2.92% (3.41%) with BART-base (T5-base) on image-text tasks, while utilizing fewer trainable parameters. Furthermore, we validate the enhanced effect of employing our VL-PET designs (e.g., granularity-controlled mechanism and lightweight designs) on existing PET techniques, enabling them to achieve significant performance improvements.

Cite

Text

Hu et al. "VL-PET: Vision-and-Language Parameter-Efficient Tuning via Granularity Control." International Conference on Computer Vision, 2023. doi:10.1109/ICCV51070.2023.00281

Markdown

[Hu et al. "VL-PET: Vision-and-Language Parameter-Efficient Tuning via Granularity Control." International Conference on Computer Vision, 2023.](https://mlanthology.org/iccv/2023/hu2023iccv-vlpet/) doi:10.1109/ICCV51070.2023.00281

BibTeX

@inproceedings{hu2023iccv-vlpet,
  title     = {{VL-PET: Vision-and-Language Parameter-Efficient Tuning via Granularity Control}},
  author    = {Hu, Zi-Yuan and Li, Yanyang and Lyu, Michael R. and Wang, Liwei},
  booktitle = {International Conference on Computer Vision},
  year      = {2023},
  pages     = {3010-3020},
  doi       = {10.1109/ICCV51070.2023.00281},
  url       = {https://mlanthology.org/iccv/2023/hu2023iccv-vlpet/}
}