ArCHer: Training Language Model Agents via Hierarchical Multi-Turn RL

Yifei Zhou, Andrea Zanette, Jiayi Pan, Aviral Kumar, Sergey Levine

ICLRW 2024

/iclrw/2024/zhou2024iclrw-archer/

Abstract

Large language models (LLMs) have the potential to tackle sequential decision-making problems due to their generalist capabilities. Instead of optimizing ``myopic'' surrogate objectives such as human preferences within a single turn, in such problems, we wish to directly optimize long-term objectives, such as user satisfaction over an entire dialogue with an LLM or delayed success metrics in web navigation. Multi-turn reinforcement learning (RL) provides an appealing approach to directly optimize long-term objectives, but how can we design effective and efficient multi-turn RL algorithms for LLMs? In this work, we propose an algorithmic framework to multi-turn RL for LLMs that preserves the flexibility of token-by-token RL used in single-turn RL problems, while still accommodating long horizons and delayed rewards more effectively. Our framework, the \textbf{A}cto\textbf{r}-\textbf{C}ritic Framework with a \textbf{H}i\textbf{e}rarchical Structu\textbf{r}e (\textbf{ArCHer}), combines a high-level off-policy RL algorithm that trains a value function with a low-level RL algorithm that trains a token-by-token policy. While ArCHer can be instantiated with multiple RL algorithms, a particularly convenient instantiation is to use temporal difference (TD) learning at the high level and on-policy token-level policy gradient at the low level. Empirically, we show that ArCHer significantly improves efficiency and performance of multi-turn LLM tasks, attaining sample efficiency boosts of about \textbf{100x} over prior on-policy methods and converging to a much better performance than other off-policy methods.

PDF ICLRW OpenReview Semantic Scholar

Cite

Text

Zhou et al. "ArCHer: Training Language Model Agents via Hierarchical Multi-Turn RL." ICLR 2024 Workshops: LLMAgents, 2024.

Markdown

[Zhou et al. "ArCHer: Training Language Model Agents via Hierarchical Multi-Turn RL." ICLR 2024 Workshops: LLMAgents, 2024.](https://mlanthology.org/iclrw/2024/zhou2024iclrw-archer/)

BibTeX

@inproceedings{zhou2024iclrw-archer,
  title     = {{ArCHer: Training Language Model Agents via Hierarchical Multi-Turn RL}},
  author    = {Zhou, Yifei and Zanette, Andrea and Pan, Jiayi and Kumar, Aviral and Levine, Sergey},
  booktitle = {ICLR 2024 Workshops: LLMAgents},
  year      = {2024},
  url       = {https://mlanthology.org/iclrw/2024/zhou2024iclrw-archer/}
}