VisuRiddles: Fine-Grained Perception Is a Primary Bottleneck for Multimodal Large Language Models in Abstract Visual Reasoning

Yan, Hao; Liu, Xingchen; Wang, Hao; Cao, Zhenbiao; Zheng, Handong; Yin, Liang; Su, Xinxing; Chen, Zihao; Wu, Jihao; Liao, Minghui; Weng, Chao; Chen, Wei; Liu, Yuliang; Bai, Xiang

VisuRiddles: Fine-Grained Perception Is a Primary Bottleneck for Multimodal Large Language Models in Abstract Visual Reasoning

Hao Yan, Xingchen Liu, Hao Wang, Zhenbiao Cao, Handong Zheng, Liang Yin, Xinxing Su, Zihao Chen, Jihao Wu, Minghui Liao, Chao Weng, Wei Chen, Yuliang Liu, Xiang Bai

ICLR 2026

/iclr/2026/yan2026iclr-visuriddles/

Abstract

Recent strides in multimodal large language models (MLLMs) have demonstrated significant progress in many reasoning tasks, but they still fail in Abstract Visual Reasoning (AVR) tasks. Our experimental findings indicate that the core bottleneck lies not only in the reasoning capabilities of MLLMs but more critically in their absence of fine-grained perception. To address this issue, we present VisuRiddles, a dedicated resource for AVR research. It consists of (i) a benchmark, collected from real-world data, for the systematic evaluation of MLLMs' AVR capabilities, and (ii) a synthesizer, which automatically generates AVR instances enriched with perceptual descriptions and reasoning chains, enabling supervised training and deeper investigation. Building on VisuRiddles, we propose a two-stage training paradigm that progressively enhances perceptual ability and strengthens reasoning, producing the Perception-Augmented Visual Reasoner (PAVR). Experiments demonstrate that PAVR unifies perception and reasoning, substantially outperforming both open-source and commercial MLLMs, thereby underscoring fine-grained perception as the primary bottleneck in AVR.

PDF ICLR OpenReview Semantic Scholar

Cite

Text

Yan et al. "VisuRiddles: Fine-Grained Perception Is a Primary Bottleneck for Multimodal Large Language Models in Abstract Visual Reasoning." International Conference on Learning Representations, 2026.

Markdown

[Yan et al. "VisuRiddles: Fine-Grained Perception Is a Primary Bottleneck for Multimodal Large Language Models in Abstract Visual Reasoning." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/yan2026iclr-visuriddles/)

BibTeX

@inproceedings{yan2026iclr-visuriddles,
  title     = {{VisuRiddles: Fine-Grained Perception Is a Primary Bottleneck for Multimodal Large Language Models in Abstract Visual Reasoning}},
  author    = {Yan, Hao and Liu, Xingchen and Wang, Hao and Cao, Zhenbiao and Zheng, Handong and Yin, Liang and Su, Xinxing and Chen, Zihao and Wu, Jihao and Liao, Minghui and Weng, Chao and Chen, Wei and Liu, Yuliang and Bai, Xiang},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/yan2026iclr-visuriddles/}
}