Incorporating Visual Experts to Resolve the Information Loss in Multimodal Large Language Models

He, Xin; Wei, Longhui; Xie, Lingxi; Tian, Qi

doi:10.24963/IJCAI.2025/123

Incorporating Visual Experts to Resolve the Information Loss in Multimodal Large Language Models

Xin He, Longhui Wei, Lingxi Xie, Qi Tian

IJCAI 2025 pp. 1098-1106

doi:10.24963/IJCAI.2025/123 /ijcai/2025/he2025ijcai-incorporating/

Abstract

Multimodal Large Language Models (MLLMs) are experiencing rapid growth, yielding a plethora of novel works recently. The prevailing trend involves adopting data-driven methodologies, wherein diverse instruction-following datasets were collected. However, these approaches always face the challenge of limited visual perception capabilities, as they solely utilizing CLIP-like encoders to extract visual information from inputs. Though these encoders are pre-trained on billions of image-text pairs, they still grapple with the information loss dilemma, given that textual captions only partially capture the contents depicted in images. To address this limitation, this paper proposes to improve the visual perception ability of MLLMs through a mixture-of-experts knowledge enhancement mechanism. Specifically, this work introduces a novel method that incorporates multi-task encoders and existing visual tools into the MLLMs training and inference pipeline, aiming to provide a more comprehensive summarization of visual inputs. Extensive experiments have evaluated its effectiveness of advancing MLLMs, showcasing improved visual perception capability achieved through the integration of visual experts.

PDF IJCAI Semantic Scholar

Cite

Text

He et al. "Incorporating Visual Experts to Resolve the Information Loss in Multimodal Large Language Models." International Joint Conference on Artificial Intelligence, 2025. doi:10.24963/IJCAI.2025/123

Markdown

[He et al. "Incorporating Visual Experts to Resolve the Information Loss in Multimodal Large Language Models." International Joint Conference on Artificial Intelligence, 2025.](https://mlanthology.org/ijcai/2025/he2025ijcai-incorporating/) doi:10.24963/IJCAI.2025/123

BibTeX

@inproceedings{he2025ijcai-incorporating,
  title     = {{Incorporating Visual Experts to Resolve the Information Loss in Multimodal Large Language Models}},
  author    = {He, Xin and Wei, Longhui and Xie, Lingxi and Tian, Qi},
  booktitle = {International Joint Conference on Artificial Intelligence},
  year      = {2025},
  pages     = {1098-1106},
  doi       = {10.24963/IJCAI.2025/123},
  url       = {https://mlanthology.org/ijcai/2025/he2025ijcai-incorporating/}
}