Beyond Verifiable Rewards: Scaling Reinforcement Learning in Language Models to Unverifiable Data

Abstract

We propose to scale RL to unverifiable data with a novel algorithm JEPO (Jensen's Evidence lower bound for Policy Optimization). While most prior effort on scaling RL for LLMs focuses on verifiable data where ground truth answers are typically short-form and can be matched easily, we investigate the case where such assumptions are less valid (e.g., when answers are long-form such as mathematical proofs). To scale RL training to unverifiable data with contemporary training constraints, we propose JEPO. JEPO applies Jensen's evidence lower bound, a pragmatic simplification of the evidence lower bound which views chain-of-thought as a latent variable in the generative process. We show that on verifiable datasets (math), JEPO is as effective as RL with verifiable reward; on semi-verifiable and unverifiable datasets (numina and numina-proof), JEPO improves on soft-match based evaluations compared to RL with verifiable reward which can only leverage a subset of the data source as well as test set likelihood evaluations.

Cite

Text

Tang et al. "Beyond Verifiable Rewards: Scaling Reinforcement Learning in Language Models to Unverifiable Data." Advances in Neural Information Processing Systems, 2025.

Markdown

[Tang et al. "Beyond Verifiable Rewards: Scaling Reinforcement Learning in Language Models to Unverifiable Data." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/tang2025neurips-beyond/)

BibTeX

@inproceedings{tang2025neurips-beyond,
  title     = {{Beyond Verifiable Rewards: Scaling Reinforcement Learning in Language Models to Unverifiable Data}},
  author    = {Tang, Yunhao and Wang, Sid and Madaan, Lovish and Munos, Remi},
  booktitle = {Advances in Neural Information Processing Systems},
  year      = {2025},
  url       = {https://mlanthology.org/neurips/2025/tang2025neurips-beyond/}
}