Advantage Shaping as Surrogate Reward Maximization: Unifying Pass@K Policy Gradients

Abstract

We unify two seemingly distinct approaches to policy gradient optimization for the Pass@K objective in reinforcement learning with verifiable rewards (RLVR): direct REINFORCE-style methods and advantage-shaping techniques that modify GRPO. By reverse-engineering existing advantage-shaping algorithms, we reveal that they implicitly optimize surrogate rewards. We specifically interpret practical ``hard-example upweighting'' modifications to GRPO as reward-level regularization. Conversely, starting from surrogate reward objectives, we provide a simple recipe for deriving both existing and new advantage-shaping methods. This perspective provides a lens for RLVR beyond our original motivation of Pass@K.

Cite

Text

Thrampoulidis et al. "Advantage Shaping as Surrogate Reward Maximization: Unifying Pass@K Policy Gradients." Transactions on Machine Learning Research, 2026.

Markdown

[Thrampoulidis et al. "Advantage Shaping as Surrogate Reward Maximization: Unifying Pass@K Policy Gradients." Transactions on Machine Learning Research, 2026.](https://mlanthology.org/tmlr/2026/thrampoulidis2026tmlr-advantage/)

BibTeX

@article{thrampoulidis2026tmlr-advantage,
  title     = {{Advantage Shaping as Surrogate Reward Maximization: Unifying Pass@K Policy Gradients}},
  author    = {Thrampoulidis, Christos and Mahdavi, Sadegh and Deng, Wenlong},
  journal   = {Transactions on Machine Learning Research},
  year      = {2026},
  url       = {https://mlanthology.org/tmlr/2026/thrampoulidis2026tmlr-advantage/}
}