OPPO: Accelerating PPO-Based RLHF via Pipeline Overlap

Yan, Kaizhuo; Yu, YingJie; Yu, Yifan; Zheng, Haizhong; Lai, Fan

OPPO: Accelerating PPO-Based RLHF via Pipeline Overlap

Kaizhuo Yan, YingJie Yu, Yifan Yu, Haizhong Zheng, Fan Lai

ICLR 2026

/iclr/2026/yan2026iclr-oppo/

Abstract

Proximal Policy Optimization (PPO)-based reinforcement learning from human feedback (RLHF) is a widely adopted paradigm for aligning large language models (LLMs) with human preferences. However, its training pipeline suffers from substantial inefficiencies due to sequential multi-model dependencies (e.g., reward model depends on actor outputs) and long-tail response lengths, where a few long responses straggle the stage completion. We present OPPO, a novel, lightweight, and model-agnostic PPO-based RLHF framework that improves training efficiency by overlapping pipeline execution. OPPO introduces two novel techniques: (1) Intra-step overlap, which streams upstream model outputs (e.g., actor model) in right-sized chunks, enabling the downstream model (e.g., reward) to begin prefill while the upstream continues decoding; and (2) Inter-step overlap, which adaptively overcommits a few prompts and defers long generations to future steps, mitigating tail latency without discarding partial work. OPPO integrates easily with existing PPO implementations with a lightweight wrapper. Extensive evaluations show that OPPO accelerates PPO-based RLHF training by $1.8\times$--$2.8\times$ and improves GPU utilization by $1.4\times$--$2.1\times$ without compromising training convergence.

PDF ICLR OpenReview Semantic Scholar

Cite

Text

Yan et al. "OPPO: Accelerating PPO-Based RLHF via Pipeline Overlap." International Conference on Learning Representations, 2026.

Markdown

[Yan et al. "OPPO: Accelerating PPO-Based RLHF via Pipeline Overlap." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/yan2026iclr-oppo/)

BibTeX

@inproceedings{yan2026iclr-oppo,
  title     = {{OPPO: Accelerating PPO-Based RLHF via Pipeline Overlap}},
  author    = {Yan, Kaizhuo and Yu, YingJie and Yu, Yifan and Zheng, Haizhong and Lai, Fan},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/yan2026iclr-oppo/}
}