On the Generalization of SFT: A Reinforcement Learning Perspective with Reward Rectification

Abstract

In this work, we present a simple yet theoretically motivated improvement to Supervised Fine-Tuning (SFT) for the Large Language Model (LLM), addressing its limited generalization compared to reinforcement learning (RL). Through mathematical analysis, we reveal that standard SFT gradients implicitly encode a problematic reward structure that may severely restrict the generalization capabilities of model compared to RL. To rectify this, we propose Dynamic Fine-Tuning (DFT), stabilizing gradient updates for each token by dynamically rescaling the objective function with the probability of this token. With just a single-line change, the method outperforms standard SFT on multiple difficult benchmarks and base models, from math reasoning to code generation and multi-modal tasks, demonstrating improved generalization. Additionally, \model~achieves competitive results in offline RL settings, providing an effective yet streamlined alternative. By bridging theoretical insights with practical solutions, this work advances the state of SFT. The source code will be available at https://github.com/yongliang-wu/DFT.

Cite

Text

Wu et al. "On the Generalization of SFT: A Reinforcement Learning Perspective with Reward Rectification." International Conference on Learning Representations, 2026.

Markdown

[Wu et al. "On the Generalization of SFT: A Reinforcement Learning Perspective with Reward Rectification." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/wu2026iclr-generalization/)

BibTeX

@inproceedings{wu2026iclr-generalization,
  title     = {{On the Generalization of SFT: A Reinforcement Learning Perspective with Reward Rectification}},
  author    = {Wu, Yongliang and Zhou, Yizhou and Ziheng, Zhou and Peng, Yingzhe and Ye, Xinyu and Hu, Xinting and Zhu, Wenbo and Qi, Lu and Yang, Ming-Hsuan and Yang, Xu},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/wu2026iclr-generalization/}
}