Advantage Shaping as Surrogate Reward Maximization: Unifying Pass@K Policy Gradients
Abstract
We unify two seemingly distinct approaches to policy gradient optimization for the Pass@K objective in reinforcement learning with verifiable rewards (RLVR): direct REINFORCE-style methods and advantage-shaping techniques that modify GRPO. By reverse-engineering existing advantage-shaping algorithms, we reveal that they implicitly optimize surrogate rewards. We specifically interpret practical ``hard-example upweighting'' modifications to GRPO as reward-level regularization. Conversely, starting from surrogate reward objectives, we provide a simple recipe for deriving both existing and new advantage-shaping methods. This perspective provides a lens for RLVR beyond our original motivation of Pass@K.
Cite
Text
Thrampoulidis et al. "Advantage Shaping as Surrogate Reward Maximization: Unifying Pass@K Policy Gradients." Transactions on Machine Learning Research, 2026.Markdown
[Thrampoulidis et al. "Advantage Shaping as Surrogate Reward Maximization: Unifying Pass@K Policy Gradients." Transactions on Machine Learning Research, 2026.](https://mlanthology.org/tmlr/2026/thrampoulidis2026tmlr-advantage/)BibTeX
@article{thrampoulidis2026tmlr-advantage,
title = {{Advantage Shaping as Surrogate Reward Maximization: Unifying Pass@K Policy Gradients}},
author = {Thrampoulidis, Christos and Mahdavi, Sadegh and Deng, Wenlong},
journal = {Transactions on Machine Learning Research},
year = {2026},
url = {https://mlanthology.org/tmlr/2026/thrampoulidis2026tmlr-advantage/}
}