Reasoning-Aligned Perception Decoupling for Scalable Multi-Modal Reasoning

Gou, Yunhao; Chen, Kai; Liu, Zhili; Hong, Lanqing; Jin, Xin; Li, Zhenguo; Kwok, James; Zhang, Yu

Reasoning-Aligned Perception Decoupling for Scalable Multi-Modal Reasoning

Yunhao Gou, Kai Chen, Zhili Liu, Lanqing Hong, Xin Jin, Zhenguo Li, James Kwok, Yu Zhang

ICLR 2026

/iclr/2026/gou2026iclr-reasoningaligned/

Abstract

Recent breakthroughs in reasoning language models have significantly advanced text-based reasoning. On the other hand, Multi-modal Large Language Models (MLLMs) still lag behind, hindered by their outdated internal LLMs. Upgrading these LLMs is often prohibitively expensive, as it requires costly vision-language alignment retraining. To address this issue, we introduce Perception-Reasoning Decoupling, which modularizes the MLLM’s reasoning component and makes it easily replaceable. This approach redefines the MLLM's role to convert multi-modal inputs into detailed textual outputs that can be processed by any powerful, external, text-only LLM reasoners. To align the MLLM's perceptual output with the final reasoning task, we propose a novel reinforcement learning algorithm called Visual Perception Optimization (VPO). VPO rewards the MLLM based on the correctness of answers generated by the external reasoner to produce faithful and query-relevant captions. Together, this decoupling pipeline and VPO form our Reasoning-Aligned PerceptIon Decoupling (RAPID) approach. Empirical results show that RAPID achieves significant performance gains on multi-modal reasoning benchmarks. Crucially, RAPID enables a novel inference-time scaling paradigm: Once trained with VPO, the MLLM can be paired with any state-of-the-art LLM reasoner for consistent performance improvement without retraining.

PDF ICLR OpenReview Semantic Scholar

Cite

Text

Gou et al. "Reasoning-Aligned Perception Decoupling for Scalable Multi-Modal Reasoning." International Conference on Learning Representations, 2026.

Markdown

[Gou et al. "Reasoning-Aligned Perception Decoupling for Scalable Multi-Modal Reasoning." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/gou2026iclr-reasoningaligned/)

BibTeX

@inproceedings{gou2026iclr-reasoningaligned,
  title     = {{Reasoning-Aligned Perception Decoupling for Scalable Multi-Modal Reasoning}},
  author    = {Gou, Yunhao and Chen, Kai and Liu, Zhili and Hong, Lanqing and Jin, Xin and Li, Zhenguo and Kwok, James and Zhang, Yu},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/gou2026iclr-reasoningaligned/}
}