From Narrow to Panoramic Vision: Attention-Guided Cold-Start Reshapes Multimodal Reasoning

Luo, Ruilin; Shi, Chufan; Zhang, Yizhen; Yang, Cheng; Jiang, Songtao; Guan, Tongkun; Chen, Ruizhe; Chu, Ruihang; Wang, Peng; Yang, Mingkun; Wang, Lei; Yang, Yujiu; Lin, Junyang; Yang, Zhibo

From Narrow to Panoramic Vision: Attention-Guided Cold-Start Reshapes Multimodal Reasoning

Ruilin Luo, Chufan Shi, Yizhen Zhang, Cheng Yang, Songtao Jiang, Tongkun Guan, Ruizhe Chen, Ruihang Chu, Peng Wang, Mingkun Yang, Lei Wang, Yujiu Yang, Junyang Lin, Zhibo Yang

ICLR 2026

/iclr/2026/luo2026iclr-narrow/

Abstract

The cold-start initialization stage plays a pivotal role in training Multimodal Large Reasoning Models (MLRMs), yet its mechanisms remain insufficiently understood. To analyze this stage, we introduce the Visual Attention Score (VAS), an attention-based metric that quantifies how much a model attends to visual tokens. We find that reasoning performance is strongly correlated with VAS (r=0.9616): models with higher VAS achieve substantially stronger multimodal reasoning. Surprisingly, multimodal cold-start fails to raise VAS, leaving distributions close to the base model, whereas text-only cold-start induces a clear increase. We term this counter-intuitive phenomenon Lazy Attention Localization. To validate its causal role, we design training-free interventions that directly manipulate attention allocation at inference time, yielding consistent 1--2% gains without retraining. Building on these insights, we propose Attention-Guided Visual Anchoring and Reflection (AVAR), a comprehensive cold-start framework that integrates visual-anchored data synthesis, attention-guided objectives, and visual-anchored reward shaping. Applied to Qwen2.5-VL-7B, AVAR delivers an average gain of 7.0% across 7 multimodal reasoning benchmarks. Ablation studies further confirm that each component of AVAR contributes step-wise to the overall gains. The code, data, and models are available at https://github.com/lrlbbzl/Qwen-AVAR.

PDF ICLR OpenReview Semantic Scholar

Cite

Text

Luo et al. "From Narrow to Panoramic Vision: Attention-Guided Cold-Start Reshapes Multimodal Reasoning." International Conference on Learning Representations, 2026.

Markdown

[Luo et al. "From Narrow to Panoramic Vision: Attention-Guided Cold-Start Reshapes Multimodal Reasoning." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/luo2026iclr-narrow/)

BibTeX

@inproceedings{luo2026iclr-narrow,
  title     = {{From Narrow to Panoramic Vision: Attention-Guided Cold-Start Reshapes Multimodal Reasoning}},
  author    = {Luo, Ruilin and Shi, Chufan and Zhang, Yizhen and Yang, Cheng and Jiang, Songtao and Guan, Tongkun and Chen, Ruizhe and Chu, Ruihang and Wang, Peng and Yang, Mingkun and Wang, Lei and Yang, Yujiu and Lin, Junyang and Yang, Zhibo},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/luo2026iclr-narrow/}
}