MARSHAL: Incentivizing Multi-Agent Reasoning via Self-Play with Strategic LLMs

Yuan, Huining; Xu, Zelai; Tan, Zheyue; Yi, Xiangmin; Guang, Mo; Long, Kaiwen; Hui, Haojia; Li, Boxun; Chen, Xinlei; Zhao, Bo; Zhang, Xiao-Ping; Yu, Chao; Wang, Yu

MARSHAL: Incentivizing Multi-Agent Reasoning via Self-Play with Strategic LLMs

Huining Yuan, Zelai Xu, Zheyue Tan, Xiangmin Yi, Mo Guang, Kaiwen Long, Haojia Hui, Boxun Li, Xinlei Chen, Bo Zhao, Xiao-Ping Zhang, Chao Yu, Yu Wang

ICLR 2026

/iclr/2026/yuan2026iclr-marshal/

Abstract

Developing Large Language Models (LLMs) to cooperate and compete effectively within multi-agent systems (MASs) is a critical step towards more advanced intelligence. While reinforcement learning (RL) has proven effective for enhancing reasoning in single-agent tasks, its extension to multi-turn, multi-agent scenarios remains underexplored due to the challenges of long-horizon credit assignment and agent-specific advantage estimation. To address these challenges, we introduce **MARSHAL**, an end-to-end RL framework that incentivizes **M**ulti-**A**gent **R**easoning through **S**elf-play wit**H** str**A**tegic **L**LMs in both cooperative and competitive games. MARSHAL features a turn-level advantage estimator that aligns learning signals with each interaction for credit assignment, and an agent-specific advantage normalization to stabilize multi-agent training. By learning with self-play across cooperative and competitive games, MARSHAL agents trained from Qwen3-4B develop strong strategic abilities, with up to $28.7$\% performance improvements in held-out games. More importantly, the capability acquired through self-play generalizes beyond games, yielding consistent performance gains of MASs in reasoning benchmarks. When integrated into leading MASs, our MARSHAL agent achieves significant zero-shot performance gains of up to $10.0$\% on AIME, $7.6$\% on GPQA-Diamond, and $3.5$\% on average across all benchmarks. These results establish self-play in strategic games as a powerful approach for developing generalizable multi-agent reasoning capabilities in LLMs.

PDF ICLR OpenReview Semantic Scholar

Cite

Text

Yuan et al. "MARSHAL: Incentivizing Multi-Agent Reasoning via Self-Play with Strategic LLMs." International Conference on Learning Representations, 2026.

Markdown

[Yuan et al. "MARSHAL: Incentivizing Multi-Agent Reasoning via Self-Play with Strategic LLMs." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/yuan2026iclr-marshal/)

BibTeX

@inproceedings{yuan2026iclr-marshal,
  title     = {{MARSHAL: Incentivizing Multi-Agent Reasoning via Self-Play with Strategic LLMs}},
  author    = {Yuan, Huining and Xu, Zelai and Tan, Zheyue and Yi, Xiangmin and Guang, Mo and Long, Kaiwen and Hui, Haojia and Li, Boxun and Chen, Xinlei and Zhao, Bo and Zhang, Xiao-Ping and Yu, Chao and Wang, Yu},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/yuan2026iclr-marshal/}
}