Generative RLHF-V: Learning Principles from Multi-Modal Human Preference

Abstract

Training multi-modal large language models (MLLMs) that align with human intentions is a long-term challenge. Traditional score-only reward models for alignment suffer from low accuracy, weak generalization, and poor interpretability, blocking the progress of alignment methods, \textit{e.g.,} reinforcement learning from human feedback (RLHF). Generative reward models (GRMs) leverage MLLMs' intrinsic reasoning capabilities to discriminate pair-wise responses, but their pair-wise paradigm makes it hard to generalize to learnable rewards. We introduce Generative RLHF-V, a novel alignment framework that integrates GRMs with multi-modal RLHF. We propose a two-stage pipeline: \textbf{multi-modal generative reward modeling from RL}, where RL guides GRMs to actively capture human intention, then predict the correct pair-wise scores; and \textbf{RL optimization from grouped comparison}, which enhances multi-modal RL scoring precision by grouped responses comparison. Experimental results demonstrate that, besides out-of-distribution generalization of RM discrimination, our framework improves 4 MLLMs' performance across 7 benchmarks by 18.1\%, while the baseline RLHF is only 5.3\%. We further validate that Generative RLHF-V achieves a near-linear improvement with an increasing number of candidate responses.

Cite

Text

Zhou et al. "Generative RLHF-V: Learning Principles from Multi-Modal Human Preference." Advances in Neural Information Processing Systems, 2025.

Markdown

[Zhou et al. "Generative RLHF-V: Learning Principles from Multi-Modal Human Preference." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/zhou2025neurips-generative/)

BibTeX

@inproceedings{zhou2025neurips-generative,
  title     = {{Generative RLHF-V: Learning Principles from Multi-Modal Human Preference}},
  author    = {Zhou, Jiayi and Ji, Jiaming and Chen, Boyuan and Sun, Jiapeng and Chen, Wenqi and Hong, Donghai and Han, Sirui and Guo, Yike and Yang, Yaodong},
  booktitle = {Advances in Neural Information Processing Systems},
  year      = {2025},
  url       = {https://mlanthology.org/neurips/2025/zhou2025neurips-generative/}
}