Reinforcing General Reasoning Without Verifiers
Abstract
The recent paradigm shift towards training large language models (LLMs) using DeepSeek-R1-Zero-style reinforcement learning (RL) on verifiable rewards has led to impressive advancements in code and mathematical reasoning. However, this methodology is limited to tasks where rule-based answer verification is possible and does not naturally extend to real-world domains such as chemistry, healthcare, engineering, law, biology, business, and economics. Current practical workarounds use an additional LLM as a model-based verifier; however, this introduces issues such as reliance on a strong verifier LLM, susceptibility to reward hacking, and the practical burden of maintaining the verifier model in memory during training. To address this and extend DeepSeek-R1-Zero-style training to general reasoning domains, we propose a verifier-free method (**VeriFree**) that bypasses answer verification and instead directly maximizes the probability of generating the reference answer, derived in a principled way from the RL objective. We compare VeriFree with verifier-based methods and demonstrate that, in addition to its significant practical benefits and reduced compute requirements, VeriFree matches and even surpasses verifier-based methods on extensive evaluations across MMLU-Pro, GPQA, SuperGPQA, and math-related benchmarks.
Cite
Text
Zhou et al. "Reinforcing General Reasoning Without Verifiers." International Conference on Learning Representations, 2026.Markdown
[Zhou et al. "Reinforcing General Reasoning Without Verifiers." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/zhou2026iclr-reinforcing/)BibTeX
@inproceedings{zhou2026iclr-reinforcing,
title = {{Reinforcing General Reasoning Without Verifiers}},
author = {Zhou, Xiangxin and Liu, Zichen and Sims, Anya and Wang, Haonan and Pang, Tianyu and Li, Chongxuan and Wang, Liang and Lin, Min and Du, Chao},
booktitle = {International Conference on Learning Representations},
year = {2026},
url = {https://mlanthology.org/iclr/2026/zhou2026iclr-reinforcing/}
}