Incorporating Visual Experts to Resolve the Information Loss in Multimodal Large Language Models
Abstract
Multimodal Large Language Models (MLLMs) are experiencing rapid growth, yielding a plethora of novel works recently. The prevailing trend involves adopting data-driven methodologies, wherein diverse instruction-following datasets were collected. However, these approaches always face the challenge of limited visual perception capabilities, as they solely utilizing CLIP-like encoders to extract visual information from inputs. Though these encoders are pre-trained on billions of image-text pairs, they still grapple with the information loss dilemma, given that textual captions only partially capture the contents depicted in images. To address this limitation, this paper proposes to improve the visual perception ability of MLLMs through a mixture-of-experts knowledge enhancement mechanism. Specifically, this work introduces a novel method that incorporates multi-task encoders and existing visual tools into the MLLMs training and inference pipeline, aiming to provide a more comprehensive summarization of visual inputs. Extensive experiments have evaluated its effectiveness of advancing MLLMs, showcasing improved visual perception capability achieved through the integration of visual experts.
Cite
Text
He et al. "Incorporating Visual Experts to Resolve the Information Loss in Multimodal Large Language Models." International Joint Conference on Artificial Intelligence, 2025. doi:10.24963/IJCAI.2025/123Markdown
[He et al. "Incorporating Visual Experts to Resolve the Information Loss in Multimodal Large Language Models." International Joint Conference on Artificial Intelligence, 2025.](https://mlanthology.org/ijcai/2025/he2025ijcai-incorporating/) doi:10.24963/IJCAI.2025/123BibTeX
@inproceedings{he2025ijcai-incorporating,
title = {{Incorporating Visual Experts to Resolve the Information Loss in Multimodal Large Language Models}},
author = {He, Xin and Wei, Longhui and Xie, Lingxi and Tian, Qi},
booktitle = {International Joint Conference on Artificial Intelligence},
year = {2025},
pages = {1098-1106},
doi = {10.24963/IJCAI.2025/123},
url = {https://mlanthology.org/ijcai/2025/he2025ijcai-incorporating/}
}