Learn the Ropes, Then Trust the Wins: Self-Imitation with Progressive Exploration for Agentic Reinforcement Learning

Qin, Yulei; Tan, Xiaoyu; He, Zhengbao; Li, Gang; Lin, Haojia; Li, Zongyi; Xu, Zihan; Shi, Yuchen; Cai, Siqi; Rui, Renting; Cai, Shaofei; Cai, Yuzheng; Zhang, Xuan; Ye, Sheng; Li, Ke; Sun, Xing

Learn the Ropes, Then Trust the Wins: Self-Imitation with Progressive Exploration for Agentic Reinforcement Learning

Yulei Qin, Xiaoyu Tan, Zhengbao He, Gang Li, Haojia Lin, Zongyi Li, Zihan Xu, Yuchen Shi, Siqi Cai, Renting Rui, Shaofei Cai, Yuzheng Cai, Xuan Zhang, Sheng Ye, Ke Li, Xing Sun

ICLR 2026

/iclr/2026/qin2026iclr-learn/

Abstract

Reinforcement learning (RL) is the dominant paradigm for sharpening strategic tool use capabilities of LLMs on long-horizon, sparsely-rewarded agent tasks, yet it faces a fundamental challenge of exploration-exploitation trade-off. Existing studies stimulate exploration through the lens of policy entropy, but such mechanical entropy maximization is prone to RL instability due to the multi-turn distribution shifting. In this paper, we target the progressive exploration-exploitation balance under the guidance of the agent's own experiences without succumbing to either entropy collapsing or runaway divergence. We propose SPEAR, a self-imitation learning (SIL) recipe for training agentic LLMs. It extends the vanilla SIL, where a replay buffer stores good experience for off-policy update, by gradually steering the policy entropy across stages. Specifically, the proposed curriculum scheduling harmonizes intrinsic reward shaping and self-imitation to 1) expedite exploration via frequent tool interactions at the beginning, and 2) strengthen exploitation of successful tactics upon convergence towards familiarity with the environment. We also combine bag-of-tricks of industrial RL optimizations for a strong baseline Dr.BoT to demonstrate our effectiveness. In ALFWorld and WebShop, SPEAR increases the success rates of GRPO/GiGPO/Dr.BoT by up to 16.1\%/5.1\%/8.6\% and 20.7\%/11.8\%/13.9\%, respectively. In AIME24 and AIME25, SPEAR boosts Dr.BoT by up to 3.8\% and 6.1\%, respectively. Such gains incur only 10\%–25\% extra theoretical complexity and negligible runtime overhead in practice, demonstrating the plug-and-play scalability of SPEAR.

PDF ICLR OpenReview Semantic Scholar

Cite

Text

Qin et al. "Learn the Ropes, Then Trust the Wins: Self-Imitation with Progressive Exploration for Agentic Reinforcement Learning." International Conference on Learning Representations, 2026.

Markdown

[Qin et al. "Learn the Ropes, Then Trust the Wins: Self-Imitation with Progressive Exploration for Agentic Reinforcement Learning." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/qin2026iclr-learn/)

BibTeX

@inproceedings{qin2026iclr-learn,
  title     = {{Learn the Ropes, Then Trust the Wins: Self-Imitation with Progressive Exploration for Agentic Reinforcement Learning}},
  author    = {Qin, Yulei and Tan, Xiaoyu and He, Zhengbao and Li, Gang and Lin, Haojia and Li, Zongyi and Xu, Zihan and Shi, Yuchen and Cai, Siqi and Rui, Renting and Cai, Shaofei and Cai, Yuzheng and Zhang, Xuan and Ye, Sheng and Li, Ke and Sun, Xing},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/qin2026iclr-learn/}
}