Multi-Modal Large Language Models Are Effective Vision Learners
Abstract
Large language models (LLMs) pre-trained on vast amounts of text have shown remarkable abilities in understanding general knowledge and commonsense. Therefore it's desirable to leverage pre-trained LLM to help solve computer vision tasks. Previous works on multi-modal LLM mainly focus on the generation capability. In this work we propose LLM-augmented visual representation learning (LMVR). Our approach involves initially using a vision encoder to extract features which are then projected into the word embedding space of the LLM. The LLM then generates responses based on the visual representation and a text prompt. Finally we aggregate sequence-level features from the hidden layers of the LLM to obtain image-level representations. We conduct extensive experiments on multiple datasets and have the following findings: (a) LMVR outperforms traditional vision encoder on various downstream tasks and effectively learns the correspondence between words and image regions; (b) LMVR improves the generalizability compared to using a vision encoder alone as evidenced by its superior resistance to domain shift; (c) LMVR improves the robustness of models to corrupted and perturbed visual data. Our findings demonstrate LLM-augmented visual representation learning is effective as it learns object-level concepts and commonsense knowledge.
Cite
Text
Sun et al. "Multi-Modal Large Language Models Are Effective Vision Learners." Winter Conference on Applications of Computer Vision, 2025.Markdown
[Sun et al. "Multi-Modal Large Language Models Are Effective Vision Learners." Winter Conference on Applications of Computer Vision, 2025.](https://mlanthology.org/wacv/2025/sun2025wacv-multimodal/)BibTeX
@inproceedings{sun2025wacv-multimodal,
title = {{Multi-Modal Large Language Models Are Effective Vision Learners}},
author = {Sun, Li and Ahuja, Chaitanya and Chen, Peng and D'Zmura, Matt and Batmanghelich, Kayhan and Bontrager, Philip},
booktitle = {Winter Conference on Applications of Computer Vision},
year = {2025},
pages = {8606-8615},
url = {https://mlanthology.org/wacv/2025/sun2025wacv-multimodal/}
}