Focus Your Attention When Few-Shot Classification

Abstract

Since many pre-trained vision transformers emerge and provide strong representation for various downstream tasks, we aim to adapt them to few-shot image classification tasks in this work. The input images typically contain multiple entities. The model may not focus on the class-related entities for the current few-shot task, even with fine-tuning on support samples, and the noise information from the class-independent ones harms performance. To this end, we first propose a method that uses the attention and gradient information to automatically locate the positions of key entities, denoted as position prompts, in the support images. Then we employ the cross-entropy loss between their many-hot presentation and the attention logits to optimize the model to focus its attention on the key entities during fine-tuning. This ability then can generalize to the query samples. Our method is applicable to different vision transformers (e.g., columnar or pyramidal ones), and also to different pre-training ways (e.g., single-modal or vision-language pre-training). Extensive experiments show that our method can improve the performance of full or parameter-efficient fine-tuning methods on few-shot tasks. Code is available at https://github.com/Haoqing-Wang/FORT.

Cite

Text

Wang et al. "Focus Your Attention When Few-Shot Classification." Neural Information Processing Systems, 2023.

Markdown

[Wang et al. "Focus Your Attention When Few-Shot Classification." Neural Information Processing Systems, 2023.](https://mlanthology.org/neurips/2023/wang2023neurips-focus-a/)

BibTeX

@inproceedings{wang2023neurips-focus-a,
  title     = {{Focus Your Attention When Few-Shot Classification}},
  author    = {Wang, Haoqing and Jie, Shibo and Deng, Zhihong},
  booktitle = {Neural Information Processing Systems},
  year      = {2023},
  url       = {https://mlanthology.org/neurips/2023/wang2023neurips-focus-a/}
}