Native Reasoning Models: Training Language Models to Reason on Unverifiable Data

Wang, Yuanfu; Liu, Zhixuan; Xiangtian, Li; Lu, Chaochao; Yang, Chao

Native Reasoning Models: Training Language Models to Reason on Unverifiable Data

Yuanfu Wang, Zhixuan Liu, Li Xiangtian, Chaochao Lu, Chao Yang

ICLR 2026

/iclr/2026/wang2026iclr-native/

Abstract

The dominant paradigm for training large reasoning models—combining Supervised Fine-Tuning (SFT) with Reinforcement Learning with Verifiable Rewards (RLVR)—is fundamentally constrained by its reliance on high-quality, human-annotated reasoning data and external verifiers. This dependency incurs significant data-collection costs, risks embedding human cognitive biases, and confines the reinforcement learning stage to objectively assessable domains like mathematics and coding, leaving a vast landscape of unverifiable tasks unaddressed. To overcome these limitations, we introduce Native Reasoning Training (NRT), a novel framework that cultivates complex reasoning by having the model generate its own reasoning traces using only standard question-answer pairs, thereby obviating the need for expert-written demonstrations. NRT reframes the training problem by treating the reasoning process as a latent variable. It employs a unified training objective that models reasoning as an optimization problem, intrinsically rewarding paths that increase the model's likelihood of producing the ground-truth answer. This unified perspective allows us to analyze intrinsic failure modes of prior methods, such as policy collapse, and systematically design more robust reward aggregation functions, creating a self-correcting feedback loop where the model learns to \textit{think} in ways that resolve its own uncertainty. Empirical evaluation on Llama and Mistral model families demonstrates that NRT achieves state-of-the-art performance among verifier-free methods, significantly outperforming standard SFT baselines and prior verifier-free RL methods. Our approach yields particularly strong performance gains in complex reasoning domains and exhibits high robustness to policy collapse, offering a general, scalable path toward building more powerful and broadly applicable reasoning systems.

PDF ICLR OpenReview Semantic Scholar

Cite

Text

Wang et al. "Native Reasoning Models: Training Language Models to Reason on Unverifiable Data." International Conference on Learning Representations, 2026.

Markdown

[Wang et al. "Native Reasoning Models: Training Language Models to Reason on Unverifiable Data." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/wang2026iclr-native/)

BibTeX

@inproceedings{wang2026iclr-native,
  title     = {{Native Reasoning Models: Training Language Models to Reason on Unverifiable Data}},
  author    = {Wang, Yuanfu and Liu, Zhixuan and Xiangtian, Li and Lu, Chaochao and Yang, Chao},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/wang2026iclr-native/}
}