Why Is Your Language Model a Poor Implicit Reward Model?

Razin, Noam; Lin, Yong; Yao, Jiarui; Arora, Sanjeev

Why Is Your Language Model a Poor Implicit Reward Model?

Noam Razin, Yong Lin, Jiarui Yao, Sanjeev Arora

ICLR 2026

/iclr/2026/razin2026iclr-your/

Abstract

Reward models are key to language model post-training and inference pipelines. Conveniently, recent work showed that every language model defines an implicit reward model (IM-RM), without requiring any architectural changes. However, such IM-RMs tend to generalize worse, especially out-of-distribution, compared to explicit reward models (EX-RMs) that apply a dedicated linear head over the hidden representations of a language model. The existence of a generalization gap is puzzling, as EX-RMs and IM-RMs are nearly identical. They can be trained using the same data, loss function, and language model, and differ only in how the reward is computed. Toward a fundamental understanding of the implicit biases underlying different reward model types, we investigate the root cause of this gap. Our main finding, backed by theory and experiments, is that IM-RMs rely more heavily on superficial token-level cues. Consequently, they often generalize worse than EX-RMs under token-level distribution shifts, as well as in-distribution. Furthermore, we provide evidence against alternative hypotheses for the generalization gap. Most notably, we challenge the claim that IM-RMs struggle in tasks where generation is harder than verification because they can operate both as a verifier and a generator. Overall, our results highlight that seemingly minor design choices can substantially impact the generalization behavior of reward models.

PDF ICLR OpenReview Semantic Scholar

Cite

Text

Razin et al. "Why Is Your Language Model a Poor Implicit Reward Model?." International Conference on Learning Representations, 2026.

Markdown

[Razin et al. "Why Is Your Language Model a Poor Implicit Reward Model?." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/razin2026iclr-your/)

BibTeX

@inproceedings{razin2026iclr-your,
  title     = {{Why Is Your Language Model a Poor Implicit Reward Model?}},
  author    = {Razin, Noam and Lin, Yong and Yao, Jiarui and Arora, Sanjeev},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/razin2026iclr-your/}
}