Words over Pixels? Rethinking Vision in Multimodal Large Language Models
Abstract
Multimodal Large Language Models (MLLMs) promise seamless integration of vision and language understanding. However, despite their strong performance, recent studies reveal that MLLMs often fail to effectively utilize visual information, frequently relying on textual cues instead. This survey provides a comprehensive analysis of the vision component in MLLMs, covering both application-level and architectural aspects. We investigate critical challenges such as weak spatial reasoning, poor fine-grained visual perception, and suboptimal fusion of visual and textual modalities. Additionally, we explore limitations in current vision encoders, benchmark inconsistencies, and their implications for downstream tasks. By synthesizing recent advancements, we highlight key research opportunities to enhance visual understanding, improve cross-modal alignment, and develop more robust and efficient MLLMs. Our observations emphasize the urgent need to elevate vision to an equal footing with language, paving the path for more reliable and perceptually aware multimodal models.
Cite
Text
Jain et al. "Words over Pixels? Rethinking Vision in Multimodal Large Language Models." International Joint Conference on Artificial Intelligence, 2025. doi:10.24963/IJCAI.2025/1164Markdown
[Jain et al. "Words over Pixels? Rethinking Vision in Multimodal Large Language Models." International Joint Conference on Artificial Intelligence, 2025.](https://mlanthology.org/ijcai/2025/jain2025ijcai-words/) doi:10.24963/IJCAI.2025/1164BibTeX
@inproceedings{jain2025ijcai-words,
title = {{Words over Pixels? Rethinking Vision in Multimodal Large Language Models}},
author = {Jain, Anubhooti and Vatsa, Mayank and Singh, Richa},
booktitle = {International Joint Conference on Artificial Intelligence},
year = {2025},
pages = {10481-10489},
doi = {10.24963/IJCAI.2025/1164},
url = {https://mlanthology.org/ijcai/2025/jain2025ijcai-words/}
}