MMICL: Empowering Vision-Language Model with Multi-Modal In-Context Learning

Haozhe Zhao, Zefan Cai, Shuzheng Si, Xiaojian Ma, Kaikai An, Liang Chen, Zixuan Liu, Sheng Wang, Wenjuan Han, Baobao Chang

ICLR 2024

/iclr/2024/zhao2024iclr-mmicl/

Abstract

Since the resurgence of deep learning, vision-language models (VLMs) enhanced by large language models (LLMs) have grown exponentially in popularity. However, while LLMs can utilize extensive background knowledge and task information with in-context learning, most VLMs still struggle with understanding complex multi-modal prompts with multiple images, making VLMs less effective in downstream vision-language tasks. In this paper, we address the limitation above by 1) introducing vision-language Model with **M**ulti-**M**odal **I**n-**C**ontext **L**earning(MMICL), a new approach to allow the VLM to deal with multi-modal inputs efficiently; 2) proposing a novel context scheme to augment the in-context learning ability of the VLM; 3) constructing the Multi-modal In-Context Learning (MIC) dataset, designed to enhance the VLM's ability to understand complex multi-modal prompts. Our experiments confirm that MMICL achieves new state-of-the-art zero-shot performance on a wide range of general vision-language tasks, especially for complex benchmarks, including MME and MMBench. Our analysis demonstrates that MMICL effectively tackles the challenge of complex multi-modal prompt understanding and emerges the impressive ICL ability. Furthermore, we observe that MMICL successfully alleviates language bias in VLMs, a common issue for VLMs that often leads to hallucination when faced with extensive textual context. Our code, dataset, dataset tool, and model are available at https://github.com/PKUnlp-icler/MIC.

PDF ICLR Semantic Scholar

Cite

Text

Zhao et al. "MMICL: Empowering Vision-Language Model with Multi-Modal In-Context Learning." International Conference on Learning Representations, 2024.

Markdown

[Zhao et al. "MMICL: Empowering Vision-Language Model with Multi-Modal In-Context Learning." International Conference on Learning Representations, 2024.](https://mlanthology.org/iclr/2024/zhao2024iclr-mmicl/)

BibTeX

@inproceedings{zhao2024iclr-mmicl,
  title     = {{MMICL: Empowering Vision-Language Model with Multi-Modal In-Context Learning}},
  author    = {Zhao, Haozhe and Cai, Zefan and Si, Shuzheng and Ma, Xiaojian and An, Kaikai and Chen, Liang and Liu, Zixuan and Wang, Sheng and Han, Wenjuan and Chang, Baobao},
  booktitle = {International Conference on Learning Representations},
  year      = {2024},
  url       = {https://mlanthology.org/iclr/2024/zhao2024iclr-mmicl/}
}