Parrot: Multilingual Visual Instruction Tuning
Abstract
The rapid development of Multimodal Large Language Models (MLLMs), such as GPT-4, marks a significant step toward artificial general intelligence. Existing methods typically align vision encoders with LLMs via supervised fine-tuning (SFT), but this often deteriorates their ability to handle multiple languages as training progresses. We empirically observe that imbalanced SFT datasets, largely English-centric, degrade performance on non-English languages due to the failure in multilingual token alignment. To address this, we propose Parrot, a novel approach that leverages textual guidance for visual token alignment at the language level. Parrot conditions visual tokens on diverse language inputs and uses Mixture-of-Experts (MoE) to align multilingual tokens. By computing cross-attention between initial visual features and textual embeddings, we select the most relevant experts, converting visual tokens into language-specific representations. Additionally, we introduce the Massive Multilingual Multimodal Benchmark (MMMB), a new benchmark comprising 6 languages, 15 categories, and 12,000 questions, to assess multilingual capabilities. Parrot achieves state-of-the-art performance on both the multilingual benchmarks and a wide range of multimodal tasks. Code and dataset are available at: https://github.com/AIDC-AI/Parrot.
Cite
Text
Sun et al. "Parrot: Multilingual Visual Instruction Tuning." Proceedings of the 42nd International Conference on Machine Learning, 2025.Markdown
[Sun et al. "Parrot: Multilingual Visual Instruction Tuning." Proceedings of the 42nd International Conference on Machine Learning, 2025.](https://mlanthology.org/icml/2025/sun2025icml-parrot/)BibTeX
@inproceedings{sun2025icml-parrot,
title = {{Parrot: Multilingual Visual Instruction Tuning}},
author = {Sun, Hai-Long and Zhou, Da-Wei and Li, Yang and Lu, Shiyin and Yi, Chao and Chen, Qing-Guo and Xu, Zhao and Luo, Weihua and Zhang, Kaifu and Zhan, De-Chuan and Ye, Han-Jia},
booktitle = {Proceedings of the 42nd International Conference on Machine Learning},
year = {2025},
pages = {57984-58007},
volume = {267},
url = {https://mlanthology.org/icml/2025/sun2025icml-parrot/}
}