Visual Structures Help Visual Reasoning: Addressing the Binding Problem in LVLMs
Abstract
Despite progress in Large Vision-Language Models (LVLMs), their capacity for visual reasoning is often limited by the binding problem: the failure to reliably associate perceptual features with their correct visual referents. This limitation underlies persistent errors in tasks such as counting, visual search, scene description, and spatial relationship understanding. A key factor is that current LVLMs process visual features largely in parallel, lacking mechanisms for spatially grounded, serial attention. This paper introduces Visual Input Structure for Enhanced Reasoning (VISER), a simple, effective method that augments visual inputs with low-level spatial structures and pairs them with a textual prompt that encourages sequential, spatially-aware parsing. We empirically demonstrate substantial performance improvements across core visual reasoning tasks, using only a single-query inference. Specifically, VISER improves GPT-4o performance on visual search, counting, and spatial relationship tasks by 25.0%, 26.8%, and 9.5%, respectively, and reduces edit distance error in scene description by 0.32 on 2D datasets. Furthermore, we find that the visual modification is essential for these gains; purely textual strategies, including Chain-of-Thought prompting, are insufficient and can even degrade performance. VISER underscores the importance of visual input design over purely linguistically based reasoning strategies and suggests that visual structuring is a powerful and general approach for enhancing compositional and spatial reasoning in LVLMs.
Cite
Text
Izadi et al. "Visual Structures Help Visual Reasoning: Addressing the Binding Problem in LVLMs." Advances in Neural Information Processing Systems, 2025.Markdown
[Izadi et al. "Visual Structures Help Visual Reasoning: Addressing the Binding Problem in LVLMs." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/izadi2025neurips-visual/)BibTeX
@inproceedings{izadi2025neurips-visual,
title = {{Visual Structures Help Visual Reasoning: Addressing the Binding Problem in LVLMs}},
author = {Izadi, Amirmohammad and Banayeeanzade, Mohammadali and Askari, Fatemeh and Rahimiakbar, Ali and Vahedi, Mohammad Mahdi and Hasani, Hosein and Baghshah, Mahdieh Soleymani},
booktitle = {Advances in Neural Information Processing Systems},
year = {2025},
url = {https://mlanthology.org/neurips/2025/izadi2025neurips-visual/}
}