Mini but Mighty: Finetuning ViTs with Mini Adapters

Abstract

Vision Transformers (ViTs) have become one of the dominant architectures in computer vision, and pre-trained ViT models are commonly adapted to new tasks via fine-tuning. Recent works proposed several parameter-efficient transfer learning methods, such as adapters, to avoid the prohibitive training and storage cost of fine-tuning. In this work, we observe that adapters perform poorly when the dimension of adapters is small, and we propose MiMi, a training framework that addresses this issue. We start with large adapters which can reach high performance, and iteratively reduce the size of every adapter. We introduce a scoring function that compares neuron importance across layers and consequently allows automatic estimation of the hidden dimension of every adapter. Our method outperforms existing methods in finding the best trade-off between accuracy and trained parameters across the three dataset benchmarks DomainNet, VTAB, and Multi-task, for a total of 29 datasets. We will release our code publicly upon acceptance.

Cite

Text

Marouf et al. "Mini but Mighty: Finetuning ViTs with Mini Adapters." Winter Conference on Applications of Computer Vision, 2024.

Markdown

[Marouf et al. "Mini but Mighty: Finetuning ViTs with Mini Adapters." Winter Conference on Applications of Computer Vision, 2024.](https://mlanthology.org/wacv/2024/marouf2024wacv-mini/)

BibTeX

@inproceedings{marouf2024wacv-mini,
  title     = {{Mini but Mighty: Finetuning ViTs with Mini Adapters}},
  author    = {Marouf, Imad Eddine and Tartaglione, Enzo and Lathuilière, Stéphane},
  booktitle = {Winter Conference on Applications of Computer Vision},
  year      = {2024},
  pages     = {1732-1741},
  url       = {https://mlanthology.org/wacv/2024/marouf2024wacv-mini/}
}