Fine-Grained Visual Recognition in the Age of Multimodal LLMs
Abstract
Fine-grained Visual Recognition (FGVR) involves differentiating between visually similar categories, and is challenging due to subtle differences between the categories and the need for large, expert-annotated datasets. We observe that recent Multimodal Large Language Models (MLLMs) demonstrate potential in FGVR, but querying such models for every test input is not practical due to high costs and time inefficiencies. To address this, we propose a novel pipeline that fine-tunes a CLIP model for FGVR by leveraging MLLMs. Our approach requires only a small support set of unlabeled images to construct a weakly supervised dataset, with MLLMs as label generators. To mitigate the impact of obtained noisy labels, we construct a candidate set for each image using labels of neighboring images, thereby increasing the likelihood of having the correct label in the candidate set. We then employ a partial label learning algorithm to fine-tune a CLIP model using these candidate sets. Our method sets a new benchmark for efficient fine-grained classification, achieving comparable performance to MLLMs at just $1/100^{th}$ of the inference cost and a fraction of the time taken.
Cite
Text
Kuchibhotla et al. "Fine-Grained Visual Recognition in the Age of Multimodal LLMs." NeurIPS 2024 Workshops: AFM, 2024.Markdown
[Kuchibhotla et al. "Fine-Grained Visual Recognition in the Age of Multimodal LLMs." NeurIPS 2024 Workshops: AFM, 2024.](https://mlanthology.org/neuripsw/2024/kuchibhotla2024neuripsw-finegrained/)BibTeX
@inproceedings{kuchibhotla2024neuripsw-finegrained,
title = {{Fine-Grained Visual Recognition in the Age of Multimodal LLMs}},
author = {Kuchibhotla, Hari Chandana and Reddy, Abbavaram Gowtham and Kancheti, Sai Srinivas and Balasubramanian, Vineeth N.},
booktitle = {NeurIPS 2024 Workshops: AFM},
year = {2024},
url = {https://mlanthology.org/neuripsw/2024/kuchibhotla2024neuripsw-finegrained/}
}