Agentic Reinforced Policy Optimization

Dong, Guanting; Mao, Hangyu; Ma, Kai; Bao, Licheng; Chen, Yifei; Wang, Zhongyuan; Chen, Zhongxia; Du, Jiazhen; Wang, Huiyang; Zhang, Fuzheng; Zhou, Guorui; Zhu, Yutao; Wen, Ji-Rong; Dou, Zhicheng

Agentic Reinforced Policy Optimization

Guanting Dong, Hangyu Mao, Kai Ma, Licheng Bao, Yifei Chen, Zhongyuan Wang, Zhongxia Chen, Jiazhen Du, Huiyang Wang, Fuzheng Zhang, Guorui Zhou, Yutao Zhu, Ji-Rong Wen, Zhicheng Dou

ICLR 2026

/iclr/2026/dong2026iclr-agentic/

Abstract

Large-scale reinforcement learning with verifiable rewards (RLVR) has proven effective in harnessing the potential of large language models (LLMs) for single-turn reasoning tasks. In realistic reasoning scenarios, LLMs often rely on external tools to assist in task-solving processes. However, current RL algorithms typically employ trajectory-level rollout sampling, consistently neglecting the fine-grained exploration of multi-turn tool-call steps. To bridge this gap, we propose Agentic Reinforced Policy Optimization (ARPO), a novel agentic RL algorithm tailored for training multi-turn LLM-based agents. Our preliminary experiments reveal that LLMs frequently exhibit increased uncertainty after tool-call steps, as evidenced by higher entropy in the distribution of generated tokens. Motivated by this, ARPO incorporates an entropy-based adaptive rollout mechanism, encouraging the policy model to adaptively branch sampling during high-entropy tool-call rounds, thereby promoting step-level exploration of latent tool-use behaviors. By integrating an advantage attribution estimation, ARPO enables LLMs to internalize advantage differences in stepwise tool-use interactions. Experiments across 13 challenging benchmarks demonstrate ARPO's superiority over trajectory-level RL algorithms. Remarkably, ARPO achieves improved performance using only half of the tool-use budget required by existing methods, offering a scalable solution for aligning LLM-based agents with real-time dynamic environments. Our codes are released at https://github.com/RUC-NLPIR/ARPO.

PDF ICLR OpenReview Semantic Scholar

Cite

Text

Dong et al. "Agentic Reinforced Policy Optimization." International Conference on Learning Representations, 2026.

Markdown

[Dong et al. "Agentic Reinforced Policy Optimization." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/dong2026iclr-agentic/)

BibTeX

@inproceedings{dong2026iclr-agentic,
  title     = {{Agentic Reinforced Policy Optimization}},
  author    = {Dong, Guanting and Mao, Hangyu and Ma, Kai and Bao, Licheng and Chen, Yifei and Wang, Zhongyuan and Chen, Zhongxia and Du, Jiazhen and Wang, Huiyang and Zhang, Fuzheng and Zhou, Guorui and Zhu, Yutao and Wen, Ji-Rong and Dou, Zhicheng},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/dong2026iclr-agentic/}
}