SRFT: A Single-Stage Method with Supervised and Reinforcement Fine-Tuning for Reasoning

Fu, Yuqian; Chen, Tinghong; Chai, Jiajun; Wang, Xihuai; Tu, Songjun; Yin, Guojun; Lin, Wei; Zhang, Qichao; Zhu, Yuanheng; Zhao, Dongbin

SRFT: A Single-Stage Method with Supervised and Reinforcement Fine-Tuning for Reasoning

Yuqian Fu, Tinghong Chen, Jiajun Chai, Xihuai Wang, Songjun Tu, Guojun Yin, Wei Lin, Qichao Zhang, Yuanheng Zhu, Dongbin Zhao

ICLR 2026

/iclr/2026/fu2026iclr-srft/

Abstract

Large language models (LLMs) have achieved remarkable progress in reasoning tasks, yet optimally integrating Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) remains a fundamental challenge. Through a comprehensive analysis of token distributions, learning dynamics, and integration mechanisms from an entropy-based perspective, we reveal key differences between these paradigms: SFT induces coarse-grained, global shifts to policy distributions, while RL performs fine-grained, selective optimizations. Our analysis further establishes entropy as a critical indicator of training efficacy. Building on these observations, we introduce **S**upervised **R**einforcement **F**ine-**T**uning (**SRFT**), a single-stage framework that unifies both fine-tuning paradigms through entropy-aware weighting mechanisms. SRFT simultaneously applies SFT and RL to directly optimize LLMs using demonstrations and self-exploration rollouts rather than through two-stage sequential methods. Extensive experiments show that SRFT outperforms zero-RL baselines by **9.0%** on five mathematical reasoning benchmarks and by **10.9%** on three out-of-distribution benchmarks. Moreover, by leveraging demonstration data, SRFT maintains a more stable policy entropy, facilitating sustained policy improvement.

PDF ICLR OpenReview Semantic Scholar

Cite

Text

Fu et al. "SRFT: A Single-Stage Method with Supervised and Reinforcement Fine-Tuning for Reasoning." International Conference on Learning Representations, 2026.

Markdown

[Fu et al. "SRFT: A Single-Stage Method with Supervised and Reinforcement Fine-Tuning for Reasoning." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/fu2026iclr-srft/)

BibTeX

@inproceedings{fu2026iclr-srft,
  title     = {{SRFT: A Single-Stage Method with Supervised and Reinforcement Fine-Tuning for Reasoning}},
  author    = {Fu, Yuqian and Chen, Tinghong and Chai, Jiajun and Wang, Xihuai and Tu, Songjun and Yin, Guojun and Lin, Wei and Zhang, Qichao and Zhu, Yuanheng and Zhao, Dongbin},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/fu2026iclr-srft/}
}