LRM-LLaVA: Overcoming the Modality Gap of Multilingual Large Language-Vision Model for Low-Resource Languages

Abstract

Multilingual large language-vision models (LVLMs), which understand and generate both text and images across multiple languages, have achieved remarkable performance on English-centric multimodal generation tasks. However, their performance on non-English tasks has been underwhelming. One major challenge with multilingual LVLMs is the modality gap between visual inputs and multilingual textual inputs/outputs due to the lack of high-quality multilingual training data. In this paper, we propose LRM-LLaVA, a multilingual large language-vision model designed for low-resource languages to overcome the modality gap. It is composed of four components: a visual encoder, a multilingual large language model, a vision-text representation projector, and a cross-modal regularizer. Both the projector and regularizer aim at reducing the modality gap and improving multilingual performance. To train LRM-LLaVA, we employ a two-stage training strategy including pre-training and instruction fine-tuning. Meanwhile, we construct a multilingual visual question answering dataset based on English open-source datasets and adopt multiple task instructions. To evaluate the performance of LVLMs across various languages, we construct four multilingual benchmarks for 10 languages, based on English open-source benchmarks. Experimental results show that LRM-LLaVA achieves competitive performance compared to other multilingual LVLMs of similar parameters.

Cite

Text

Li et al. "LRM-LLaVA: Overcoming the Modality Gap of Multilingual Large Language-Vision Model for Low-Resource Languages." AAAI Conference on Artificial Intelligence, 2025. doi:10.1609/AAAI.V39I23.34623

Markdown

[Li et al. "LRM-LLaVA: Overcoming the Modality Gap of Multilingual Large Language-Vision Model for Low-Resource Languages." AAAI Conference on Artificial Intelligence, 2025.](https://mlanthology.org/aaai/2025/li2025aaai-lrm/) doi:10.1609/AAAI.V39I23.34623

BibTeX

@inproceedings{li2025aaai-lrm,
  title     = {{LRM-LLaVA: Overcoming the Modality Gap of Multilingual Large Language-Vision Model for Low-Resource Languages}},
  author    = {Li, Junchen and Yang, Qing and Jiang, Bojian and Zhu, Shaolin and Sun, Qingxuan},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2025},
  pages     = {24449-24457},
  doi       = {10.1609/AAAI.V39I23.34623},
  url       = {https://mlanthology.org/aaai/2025/li2025aaai-lrm/}
}