REST: Reshaping Token-Level Policy Gradients for Tool-Use Large Language Models

Lin, Zihan; Wang, Xiaohan; Cao, Jie; Chai, Jiajun; Yin, Guojun; Lin, Wei; He, Ran

REST: Reshaping Token-Level Policy Gradients for Tool-Use Large Language Models

Zihan Lin, Xiaohan Wang, Jie Cao, Jiajun Chai, Guojun Yin, Wei Lin, Ran He

ICLR 2026

/iclr/2026/lin2026iclr-rest/

Abstract

Large language models (LLMs) transcend passive generation and act as goal-directed agents by invoking external tools. Reinforcement learning (RL) offers a principled framework for optimizing these emergent tool-use policies, yet the prevailing paradigm relies exclusively on sparse outcome rewards and lacks consideration of the particularity of tool-use tasks, inflating policy-gradient variance and resulting in inefficient training. To better understand and address these challenges, we first establish a theoretical link between policy entropy and training stability of tool-use tasks, which reveals that structured, low-entropy tokens are primary determinants of rewards. Motivated by this insight, we propose Reshaped Token-level policy gradients (ResT) for tool-use tasks. ResT reshapes the policy gradient through entropy-informed token reweighting, progressively upweighting reasoning tokens as training proceeds. This scheme enables a smooth shift from structural correctness to semantic reasoning and stabilizes convergence in multi-turn tool-use tasks. Evaluation on BFCL and API-Bank shows that ResT outperforms other strong baselines, outperforming prior methods by up to 8.76%. When fine-tuned on a 4B base LLM, ResT further surpasses GPT-4o by 4.11% on single-turn tasks and 1.50% on multi-turn base tasks. Code is available at https://github.com/1229095296/ResT_Tool_use_LLM.

PDF ICLR OpenReview Semantic Scholar

Cite

Text

Lin et al. "REST: Reshaping Token-Level Policy Gradients for Tool-Use Large Language Models." International Conference on Learning Representations, 2026.

Markdown

[Lin et al. "REST: Reshaping Token-Level Policy Gradients for Tool-Use Large Language Models." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/lin2026iclr-rest/)

BibTeX

@inproceedings{lin2026iclr-rest,
  title     = {{REST: Reshaping Token-Level Policy Gradients for Tool-Use Large Language Models}},
  author    = {Lin, Zihan and Wang, Xiaohan and Cao, Jie and Chai, Jiajun and Yin, Guojun and Lin, Wei and He, Ran},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/lin2026iclr-rest/}
}