Exploring Perceptual Limitations of Multimodal LLMs on Small Visual Objects
Abstract
Multimodal Large Language Models (MLLMs) have recently achieved remarkable performance in various multimodal benchmarks. However, general benchmarks often do not reveal the specific aspects of their visual perception limits due to the lack of controllability. In this work, we quantitatively study the perception of small visual objects in several widely-used MLLMs and reveal a pervasive limitation in answering questions about small objects in images. We then conduct a controlled study of MLLMs' perception, using text-reading as a surrogate task for general visual perception to understand how quality, size, distractors, and location of an object can independently affect the ability of MLLMs to perceive it in images. Through this controlled study, we find that lower object quality, smaller object size and the presence of visual distractors can both independently reduce MLLMs' ability to answer visual questions. More surprisingly, even local perturbations of an object by a few pixels can cause a drastic decline in the ability of MLLMs to perceive it. Our study provides a better understanding of the perceptual limitations of MLLMs and contributes new evaluation protocols for analyzing, enhancing perception of future MLLMs.
Cite
Text
Zhang et al. "Exploring Perceptual Limitations of Multimodal LLMs on Small Visual Objects." Transactions on Machine Learning Research, 2026.Markdown
[Zhang et al. "Exploring Perceptual Limitations of Multimodal LLMs on Small Visual Objects." Transactions on Machine Learning Research, 2026.](https://mlanthology.org/tmlr/2026/zhang2026tmlr-exploring/)BibTeX
@article{zhang2026tmlr-exploring,
title = {{Exploring Perceptual Limitations of Multimodal LLMs on Small Visual Objects}},
author = {Zhang, Jiarui and Hu, Jinyi and Khayatkhoei, Mahyar and Ilievski, Filip and Sun, Maosong},
journal = {Transactions on Machine Learning Research},
year = {2026},
url = {https://mlanthology.org/tmlr/2026/zhang2026tmlr-exploring/}
}