MM1: Methods, Analysis & Insights from Multimodal LLM Pre-Training

Abstract

In this work, we discuss building performant Multimodal Large Language Models (MLLMs). In particular, we study the importance of various architecture components and data choices. Through careful and comprehensive ablations of the image encoder, the vision language connector, and various pre-training data choices, we identified several crucial design lessons. For example, we demonstrate that for large-scale multimodal pre-training using a careful mix of image-caption, interleaved image-text, and text-only data is crucial for achieving state-of-the-art (SOTA) few-shot results across multiple benchmarks, compared to other published multimodal pre-training results. Further, we show that the image encoder together with image resolution and the image token count has substantial impact, while the vision-language connector design is of comparatively negligible importance. By scaling up the presented recipe, we build MM1, a family of multimodal models, including both dense variants up to 30B and mixture-of-experts (MoE) variants up to 64B, that are SOTA in pre-training metrics and achieve competitive performance after supervised fine-tuning on a range of established multimodal benchmarks. Thanks to large-scale pre-training, MM1 enjoys appealing properties such as enhanced in-context learning, and multi-image reasoning, enabling few-shot chain-of-thought prompting.

Cite

Text

McKinzie et al. "MM1: Methods, Analysis & Insights from Multimodal LLM Pre-Training." Proceedings of the European Conference on Computer Vision (ECCV), 2024. doi:10.1007/978-3-031-73397-0_18

Markdown

[McKinzie et al. "MM1: Methods, Analysis & Insights from Multimodal LLM Pre-Training." Proceedings of the European Conference on Computer Vision (ECCV), 2024.](https://mlanthology.org/eccv/2024/mckinzie2024eccv-mm1/) doi:10.1007/978-3-031-73397-0_18

BibTeX

@inproceedings{mckinzie2024eccv-mm1,
  title     = {{MM1: Methods, Analysis & Insights from Multimodal LLM Pre-Training}},
  author    = {McKinzie, Brandon and Gan, Zhe and Fauconnier, Jean-Philippe and Dodge, Samuel and Zhang, Bowen and Dufter, Philipp and Shah, Dhruti and Peng, Futang and Belyi, Anton and Schwarzer, Max A and Hè, Hongyu and Du, Xianzhi and Zhang, Haotian and Singh, Karanjeet and Kang, Doug and Gunter, Tom and Kong, Xiang and Zhang, Aonan and Wang, Jianyu and Wang, Chong and Du, Nan and Lei, Tao and Wiseman, Sam and Lee, Mark and Wang, Zirui and Pang, Ruoming and Grasch, Peter and Toshev, Alexander and Yang, Yinfei},
  booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
  year      = {2024},
  doi       = {10.1007/978-3-031-73397-0_18},
  url       = {https://mlanthology.org/eccv/2024/mckinzie2024eccv-mm1/}
}