Asymmetric REINFORCE for Off-Policy Reinforcement Learning: Balancing Positive and Negative Rewards

Abstract

Reinforcement learning (RL) is increasingly used to align large language models (LLMs). Off-policy methods offer greater implementation simplicity and data efficiency than on-policy techniques, but often result in suboptimal performance. In this work, we study the intermediate range of algorithms between off-policy RL and supervised fine-tuning by analyzing a simple off-policy REINFORCE algorithm, where the advantage is defined as $A=r-V$, with $r$ a reward and $V$ some tunable baseline. Intuitively, lowering $V$ emphasizes high-reward samples, while raising it penalizes low-reward ones more heavily. We first provide a theoretical analysis of this off-policy REINFORCE algorithm, showing that when the baseline $V$ lower-bounds the expected reward, the algorithm enjoys a policy improvement guarantee. Our analysis reveals that while on-policy updates can safely leverage both positive and negative signals, off-policy updates benefit from focusing more on positive rewards than on negative ones. We validate our findings experimentally in a controlled stochastic bandit setting and through fine-tuning state-of-the-art LLMs on reasoning tasks.

Cite

Text

Arnal et al. "Asymmetric REINFORCE for Off-Policy Reinforcement Learning: Balancing Positive and Negative Rewards." Advances in Neural Information Processing Systems, 2025.

Markdown

[Arnal et al. "Asymmetric REINFORCE for Off-Policy Reinforcement Learning: Balancing Positive and Negative Rewards." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/arnal2025neurips-asymmetric/)

BibTeX

@inproceedings{arnal2025neurips-asymmetric,
  title     = {{Asymmetric REINFORCE for Off-Policy Reinforcement Learning: Balancing Positive and Negative Rewards}},
  author    = {Arnal, Charles and Narozniak, Gaëtan and Cabannes, Vivien and Tang, Yunhao and Kempe, Julia and Munos, Remi},
  booktitle = {Advances in Neural Information Processing Systems},
  year      = {2025},
  url       = {https://mlanthology.org/neurips/2025/arnal2025neurips-asymmetric/}
}