R1-Reward: Training Multimodal Reward Model Through Stable Reinforcement Learning

Zhang, YiFan; Lu, Xingyu; Hu, Xiao; Fu, Chaoyou; Wen, Bin; Zhang, Tianke; Liu, Changyi; Jiang, Kaiyu; Chen, Kaibing; Tang, Kaiyu; Ding, Haojie; Chen, Jiankang; Yang, Fan; Zhang, Zhang; Gao, Tingting; Zhang, Di; Zhou, Guorui; Wang, Liang

R1-Reward: Training Multimodal Reward Model Through Stable Reinforcement Learning

YiFan Zhang, Xingyu Lu, Xiao Hu, Chaoyou Fu, Bin Wen, Tianke Zhang, Changyi Liu, Kaiyu Jiang, Kaibing Chen, Kaiyu Tang, Haojie Ding, Jiankang Chen, Fan Yang, Zhang Zhang, Tingting Gao, Di Zhang, Guorui Zhou, Liang Wang

ICLR 2026

/iclr/2026/zhang2026iclr-r1reward/

Abstract

Multimodal Reward Models (MRMs) play a crucial role in enhancing the performance of Multimodal Large Language Models (MLLMs). While recent advancements have primarily focused on improving the model structure and training data of MRMs, there has been limited exploration into the effectiveness of long-term reasoning capabilities for reward modeling and how to activate these capabilities in MRMs. In this paper, we explore how Reinforcement Learning (RL) can be used to improve reward modeling. Specifically, we reformulate the reward modeling problem as a rule-based RL task. However, we observe that directly applying existing RL algorithms, such as Reinforce++, to reward modeling often leads to training instability or even collapse due to the inherent limitations of these algorithms. To address this issue, we propose the StableReinforce algorithm, which refines the training loss, advantage estimation strategy, and reward design of existing RL methods. These refinements result in more stable training dynamics and superior performance. To facilitate MRM training, we collect 200K preference data from diverse datasets. Our reward model, R1-Reward, trained using the StableReinforce algorithm on this dataset, significantly improves performance on multimodal reward modeling benchmarks. Compared to previous SOTA models, R1-Reward achieves a 8.4% improvement on the VL Reward-Bench and a 14.3% improvement on the Multimodal Reward Bench. Moreover, with more inference compute, R1-Reward's performance is further enhanced, highlighting the potential of RL algorithms in optimizing MRMs.

PDF ICLR OpenReview Semantic Scholar

Cite

Text

Zhang et al. "R1-Reward: Training Multimodal Reward Model Through Stable Reinforcement Learning." International Conference on Learning Representations, 2026.

Markdown

[Zhang et al. "R1-Reward: Training Multimodal Reward Model Through Stable Reinforcement Learning." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/zhang2026iclr-r1reward/)

BibTeX

@inproceedings{zhang2026iclr-r1reward,
  title     = {{R1-Reward: Training Multimodal Reward Model Through Stable Reinforcement Learning}},
  author    = {Zhang, YiFan and Lu, Xingyu and Hu, Xiao and Fu, Chaoyou and Wen, Bin and Zhang, Tianke and Liu, Changyi and Jiang, Kaiyu and Chen, Kaibing and Tang, Kaiyu and Ding, Haojie and Chen, Jiankang and Yang, Fan and Zhang, Zhang and Gao, Tingting and Zhang, Di and Zhou, Guorui and Wang, Liang},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/zhang2026iclr-r1reward/}
}