BaseReward: A Strong Baseline for Multimodal Reward Model

Zhang, YiFan; Yang, Haihua; Zhang, Huanyu; Shi, Yang; Chen, Zezhou; Tian, Haochen; Fu, Chaoyou; Wu, Kai; Cui, Bo; Wang, Xu; Pan, Jianfei; Wang, Haotian; Zhang, Zhang; Wang, Liang

BaseReward: A Strong Baseline for Multimodal Reward Model

YiFan Zhang, Haihua Yang, Huanyu Zhang, Yang Shi, Zezhou Chen, Haochen Tian, Chaoyou Fu, Kai Wu, Bo Cui, Xu Wang, Jianfei Pan, Haotian Wang, Zhang Zhang, Liang Wang

ICLR 2026

/iclr/2026/zhang2026iclr-basereward/

Abstract

The rapid advancement of Multimodal Large Language Models (MLLMs) has made aligning them with human preferences a critical challenge. Reward Models (RMs) are a core technology for achieving this goal, but a systematic guide for building state-of-the-art Multimodal Reward Models (MRMs) is currently lacking in both academia and industry. Through exhaustive experimental analysis, this paper aims to provide a clear “recipe” for constructing high-performance MRMs. We systematically investigate every crucial component in the MRM development pipeline, including reward modeling paradigms (e.g., Naive-RM, Critic-based RM, and Generative RM), reward head architecture, training strategies, data curation (covering over ten multimodal and text-only preference datasets), backbone model and model scale, and ensemble methods. Based on these experimental insights, we introduce BaseReward, a powerful and efficient baseline for multimodal reward modeling. BaseReward adopts a simple yet effective architecture, built upon a Qwen2.5-VL backbone, featuring an optimized two-layer reward head, and is trained on a carefully curated mixture of high-quality multimodal and text-only preference data. Our results show that BaseReward establishes a new state-of-the-art (SOTA) on major benchmarks such as MM-RLHF-Reward Bench, VL-Reward Bench, and Multimodal Reward Bench, outperforming previous open-source and proprietary models. Furthermore, to validate its practical utility beyond static benchmarks, we integrate BaseReward into a real-world reinforcement learning pipeline, successfully enhancing an MLLM’s performance across various perception, reasoning, and conversational tasks. This work not only delivers a top-tier MRM but, more importantly, provides the community with a clear, empirically backed guide for developing robust reward models for the next generation of MLLMs.

PDF ICLR OpenReview Semantic Scholar

Cite

Text

Zhang et al. "BaseReward: A Strong Baseline for Multimodal Reward Model." International Conference on Learning Representations, 2026.

Markdown

[Zhang et al. "BaseReward: A Strong Baseline for Multimodal Reward Model." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/zhang2026iclr-basereward/)

BibTeX

@inproceedings{zhang2026iclr-basereward,
  title     = {{BaseReward: A Strong Baseline for Multimodal Reward Model}},
  author    = {Zhang, YiFan and Yang, Haihua and Zhang, Huanyu and Shi, Yang and Chen, Zezhou and Tian, Haochen and Fu, Chaoyou and Wu, Kai and Cui, Bo and Wang, Xu and Pan, Jianfei and Wang, Haotian and Zhang, Zhang and Wang, Liang},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/zhang2026iclr-basereward/}
}