What Makes Multimodal In-Context Learning Work?
Abstract
Large Language Models have demonstrated remarkable performance across various tasks, exhibiting the capacity to swiftly acquire new skills, such as through In-Context Learning (ICL) with minimal demonstration examples. In this work, we present a comprehensive framework for investigating Multimodal ICL (M-ICL) in the context of Large Multimodal Models. We consider the best open-source multimodal models (e.g., IDEFICS, OpenFlamingo) and a wide range of multimodal tasks. Our study unveils several noteworthy findings: (1) M-ICL primarily relies on text-driven mechanisms, showing little to no influence from the image modality. (2) When used with advanced-ICL strategy (like RICES), M-ICL is not better than a simple strategy based on majority voting over context examples. Moreover, we identify several biases and limitations of M-ICL that warrant consideration prior to deployment. Code available at gitlab.com/folbaeni/multimodal-icl
Cite
Text
Baldassini et al. "What Makes Multimodal In-Context Learning Work?." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2024. doi:10.1109/CVPRW63382.2024.00161Markdown
[Baldassini et al. "What Makes Multimodal In-Context Learning Work?." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2024.](https://mlanthology.org/cvprw/2024/baldassini2024cvprw-makes/) doi:10.1109/CVPRW63382.2024.00161BibTeX
@inproceedings{baldassini2024cvprw-makes,
title = {{What Makes Multimodal In-Context Learning Work?}},
author = {Baldassini, Folco Bertini and Shukor, Mustafa and Cord, Matthieu and Soulier, Laure and Piwowarski, Benjamin},
booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops},
year = {2024},
pages = {1539-1550},
doi = {10.1109/CVPRW63382.2024.00161},
url = {https://mlanthology.org/cvprw/2024/baldassini2024cvprw-makes/}
}