Words over Pixels? Rethinking Vision in Multimodal Large Language Models

Abstract

Multimodal Large Language Models (MLLMs) promise seamless integration of vision and language understanding. However, despite their strong performance, recent studies reveal that MLLMs often fail to effectively utilize visual information, frequently relying on textual cues instead. This survey provides a comprehensive analysis of the vision component in MLLMs, covering both application-level and architectural aspects. We investigate critical challenges such as weak spatial reasoning, poor fine-grained visual perception, and suboptimal fusion of visual and textual modalities. Additionally, we explore limitations in current vision encoders, benchmark inconsistencies, and their implications for downstream tasks. By synthesizing recent advancements, we highlight key research opportunities to enhance visual understanding, improve cross-modal alignment, and develop more robust and efficient MLLMs. Our observations emphasize the urgent need to elevate vision to an equal footing with language, paving the path for more reliable and perceptually aware multimodal models.

Cite

Text

Jain et al. "Words over Pixels? Rethinking Vision in Multimodal Large Language Models." International Joint Conference on Artificial Intelligence, 2025. doi:10.24963/IJCAI.2025/1164

Markdown

[Jain et al. "Words over Pixels? Rethinking Vision in Multimodal Large Language Models." International Joint Conference on Artificial Intelligence, 2025.](https://mlanthology.org/ijcai/2025/jain2025ijcai-words/) doi:10.24963/IJCAI.2025/1164

BibTeX

@inproceedings{jain2025ijcai-words,
  title     = {{Words over Pixels? Rethinking Vision in Multimodal Large Language Models}},
  author    = {Jain, Anubhooti and Vatsa, Mayank and Singh, Richa},
  booktitle = {International Joint Conference on Artificial Intelligence},
  year      = {2025},
  pages     = {10481-10489},
  doi       = {10.24963/IJCAI.2025/1164},
  url       = {https://mlanthology.org/ijcai/2025/jain2025ijcai-words/}
}