Perception-Aware Policy Optimization for Multimodal Reasoning

Wang, Zhenhailong; Guo, Xuehang; Stoica, Sofia; Xu, Haiyang; Wang, Hongru; Ha, Hyeonjeong; Chen, Xiusi; Chen, Yangyi; Yan, Ming; Huang, Fei; Ji, Heng

Perception-Aware Policy Optimization for Multimodal Reasoning

Zhenhailong Wang, Xuehang Guo, Sofia Stoica, Haiyang Xu, Hongru Wang, Hyeonjeong Ha, Xiusi Chen, Yangyi Chen, Ming Yan, Fei Huang, Heng Ji

ICLR 2026

/iclr/2026/wang2026iclr-perceptionaware/

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has proven to be a highly effective strategy for empowering Large Language Models (LLMs) with long chain-of-thought reasoning abilities. However, its design and optimizations remain tailored to purely textual domains, resulting in suboptimal performance when applied to multimodal reasoning tasks. In particular, we observe that a major source of error (67%) in current multimodal reasoning lies in the perception of visual inputs. To address this bottleneck, we propose PAPO, a novel policy gradient algorithm that encourages the model to generate visually grounded reasoning without external supervision. Specifically, we introduce the Implicit Perception Loss in the form of a KL divergence term, which maximizes the difference between two probability distributions over the same rollout sequence, conditioned on either the original or corrupted visual input. Notably, PAPO does not rely on any additional data annotation, reward models, or stronger teacher models, and can therefore be seamlessly integrated into mainstream RLVR algorithms such as GRPO and DAPO. To further enhance the training stability of PAPO, we introduce the Double Entropy Loss, which effectively regularizes the new KL objective without compromising performance. Despite its simplicity, PAPO yields significant overall improvements of 4.4%-17.5% on diverse multimodal benchmarks. The improvements are more pronounced, approaching 8.0%-19.1%, on tasks with high vision dependency. We also observe a substantial reduction of 30.5% in perception errors, indicating improved perceptual capabilities with PAPO. Overall, PAPO offers a new perspective on advancing multimodal RLVR via the optimization objective, moving beyond rollout or reward design and pointing toward deeper integration of perception and reasoning.

PDF ICLR OpenReview Semantic Scholar

Cite

Text

Wang et al. "Perception-Aware Policy Optimization for Multimodal Reasoning." International Conference on Learning Representations, 2026.

Markdown

[Wang et al. "Perception-Aware Policy Optimization for Multimodal Reasoning." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/wang2026iclr-perceptionaware/)

BibTeX

@inproceedings{wang2026iclr-perceptionaware,
  title     = {{Perception-Aware Policy Optimization for Multimodal Reasoning}},
  author    = {Wang, Zhenhailong and Guo, Xuehang and Stoica, Sofia and Xu, Haiyang and Wang, Hongru and Ha, Hyeonjeong and Chen, Xiusi and Chen, Yangyi and Yan, Ming and Huang, Fei and Ji, Heng},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/wang2026iclr-perceptionaware/}
}