EEPO: Exploration-Enhanced Policy Optimization via Sample-Then-Forget

Chen, Liang; Han, Xueting; Wang, Qizhou; Han, Bo; Bai, Jing; Schuetze, Hinrich; Wong, Kam-Fai

EEPO: Exploration-Enhanced Policy Optimization via Sample-Then-Forget

Liang Chen, Xueting Han, Qizhou Wang, Bo Han, Jing Bai, Hinrich Schuetze, Kam-Fai Wong

ICLR 2026

/iclr/2026/chen2026iclr-eepo/

Abstract

Balancing exploration and exploitation remains a central challenge in reinforcement learning with verifiable rewards (RLVR) for large language models (LLMs). Current RLVR methods often overemphasize exploitation, leading to entropy collapse, diminished exploratory capacity, and ultimately limited performance gains. Although techniques that increase policy stochasticity can promote exploration, they frequently fail to escape dominant behavioral modes. This creates a self-reinforcing loop—repeatedly sampling and rewarding dominant modes—that further erodes exploration. We introduce **E**xploration-**E**nhanced **P**olicy **O**ptimization (**EEPO**), a framework that promotes exploration via two-stage rollouts with adaptive unlearning. In the first stage, the model generates half of the trajectories; it then undergoes a lightweight unlearning step to temporarily suppress these sampled responses, forcing the second stage to explore different regions of the output space. This *sample-then-forget* mechanism disrupts the self-reinforcing loop and promotes wider exploration during rollouts. Across five reasoning benchmarks, EEPO outperforms GRPO, achieving average relative gains of 24.3\% on Qwen2.5-3B, 33.0\% on Llama3.2-3B-Instruct, and 10.4\% on Qwen3-8B-Base.

PDF ICLR OpenReview Semantic Scholar

Cite

Text

Chen et al. "EEPO: Exploration-Enhanced Policy Optimization via Sample-Then-Forget." International Conference on Learning Representations, 2026.

Markdown

[Chen et al. "EEPO: Exploration-Enhanced Policy Optimization via Sample-Then-Forget." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/chen2026iclr-eepo/)

BibTeX

@inproceedings{chen2026iclr-eepo,
  title     = {{EEPO: Exploration-Enhanced Policy Optimization via Sample-Then-Forget}},
  author    = {Chen, Liang and Han, Xueting and Wang, Qizhou and Han, Bo and Bai, Jing and Schuetze, Hinrich and Wong, Kam-Fai},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/chen2026iclr-eepo/}
}