What Factors Affect Multi-Modal In-Context Learning? an In-Depth Exploration

Abstract

Recently, rapid advancements in Multi-Modal In-Context Learning (MM-ICL) have achieved notable success, which is capable of achieving superior performance across various tasks without requiring additional parameter tuning. However, the underlying rules for the effectiveness of MM-ICL remain under-explored. To fill this gap, this work aims to investigate the research question: "What factors affect the performance of MM-ICL?" To this end, we investigate extensive experiments on the three core steps of MM-ICL including demonstration retrieval, demonstration ordering, and prompt construction using 6 vision large language models and 20 strategies. Our findings highlight (1) the necessity of a multi-modal retriever for demonstration retrieval, (2) the importance of intra-demonstration ordering over inter-demonstration ordering, and (3) the enhancement of task comprehension through introductory instructions in prompts. We hope this study can serve as a foundational guide for optimizing MM-ICL strategies in future research.

Cite

Text

Qin et al. "What Factors Affect Multi-Modal In-Context Learning? an In-Depth Exploration." Neural Information Processing Systems, 2024. doi:10.52202/079017-3916

Markdown

[Qin et al. "What Factors Affect Multi-Modal In-Context Learning? an In-Depth Exploration." Neural Information Processing Systems, 2024.](https://mlanthology.org/neurips/2024/qin2024neurips-factors/) doi:10.52202/079017-3916

BibTeX

@inproceedings{qin2024neurips-factors,
  title     = {{What Factors Affect Multi-Modal In-Context Learning? an In-Depth Exploration}},
  author    = {Qin, Libo and Chen, Qiguang and Fei, Hao and Chen, Zhi and Li, Min and Che, Wanxiang},
  booktitle = {Neural Information Processing Systems},
  year      = {2024},
  doi       = {10.52202/079017-3916},
  url       = {https://mlanthology.org/neurips/2024/qin2024neurips-factors/}
}