Stepwise Guided Policy Optimization: Coloring Your Incorrect Reasoning in GRPO
Abstract
Reinforcement learning (RL) has proven effective in strengthening the reasoning capabilities of large language models (LLMs). A widely adopted method, Group Relative Policy Optimization (GRPO)~\citep{Shao-2024-Deepseekmath}, has shown strong empirical results in training recent reasoning models~\citep{Guo-2025-Deepseek}, but it fails to update the policy when all responses within a group are incorrect (i.e., all-negative-sample groups). This limitation highlights a gap between artificial and human intelligence: unlike humans, who can learn from mistakes, GRPO discards these failure signals. We introduce a simple framework to mitigate the all-negative-sample issue by incorporating response diversity within groups using a \textit{step-wise} judge model, which can be trained directly or adapted from existing LLMs. In a simplified setting, we prove that this diversification accelerates GRPO’s learning dynamics. We then empirically validate Stepwise Guided Policy Optimization (SGPO) across model sizes (7B, 14B, 32B) in both offline and online training on nine reasoning benchmarks (including base and distilled variants). Overall, SGPO improves average performance and is effective in early and mid-training when all-negative groups are prevalent, while improvements are not uniform across every benchmark and depend on the structure and informativeness of negative samples. Finally, SGPO does not require the judge model to generate correct solutions, distinguishing it from knowledge distillation methods.
Cite
Text
Chen et al. "Stepwise Guided Policy Optimization: Coloring Your Incorrect Reasoning in GRPO." Transactions on Machine Learning Research, 2026.Markdown
[Chen et al. "Stepwise Guided Policy Optimization: Coloring Your Incorrect Reasoning in GRPO." Transactions on Machine Learning Research, 2026.](https://mlanthology.org/tmlr/2026/chen2026tmlr-stepwise/)BibTeX
@article{chen2026tmlr-stepwise,
title = {{Stepwise Guided Policy Optimization: Coloring Your Incorrect Reasoning in GRPO}},
author = {Chen, Peter and Li, Xiaopeng and Li, Ziniu and Chen, Xi and Lin, Tianyi},
journal = {Transactions on Machine Learning Research},
year = {2026},
url = {https://mlanthology.org/tmlr/2026/chen2026tmlr-stepwise/}
}