Generative Multimodal Models Are In-Context Learners

Abstract

Humans can easily solve multimodal tasks in context with only a few demonstrations or simple instructions which current multimodal systems largely struggle to imitate. In this work we demonstrate that by effectively scaling up generative multimodal models their task-agnostic in-context learning capabilities can be significantly enhanced. We introduce Emu2 a generative multimodal model with 37 billion parameters which serves as a base model and general-purpose interface for a variety of multimodal tasks. Emu2 not only achieves strong performance in few-shot setting but can also be instruct-tuned to follow specific instructions such as visual question answering and object-grounded image generation. Emu2 even emerges to solve tasks that require on-the-fly reasoning such as visual prompting which existing models are unlikely to handle. We identify additional tasks where Emu2's in-context learning can further improve and discuss its broader societal impact. Our code and models will be made publicly available to facilitate future research.

Cite

Text

Sun et al. "Generative Multimodal Models Are In-Context Learners." Conference on Computer Vision and Pattern Recognition, 2024. doi:10.1109/CVPR52733.2024.01365

Markdown

[Sun et al. "Generative Multimodal Models Are In-Context Learners." Conference on Computer Vision and Pattern Recognition, 2024.](https://mlanthology.org/cvpr/2024/sun2024cvpr-generative/) doi:10.1109/CVPR52733.2024.01365

BibTeX

@inproceedings{sun2024cvpr-generative,
  title     = {{Generative Multimodal Models Are In-Context Learners}},
  author    = {Sun, Quan and Cui, Yufeng and Zhang, Xiaosong and Zhang, Fan and Yu, Qiying and Wang, Yueze and Rao, Yongming and Liu, Jingjing and Huang, Tiejun and Wang, Xinlong},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2024},
  pages     = {14398-14409},
  doi       = {10.1109/CVPR52733.2024.01365},
  url       = {https://mlanthology.org/cvpr/2024/sun2024cvpr-generative/}
}