MALT: Improving Reasoning with Multi-Agent LLM Training

Abstract

Large Language Models (LLMs) often produce answers with a single chain-of-thought, which restricts their ability to explore reasoning paths or self-correct flawed outputs in complex tasks. In this paper, we introduce MALT (Multi-Agent LLM Training), a novel post-training strategy that divides the reasoning process into generation, verification, and refinement steps using a sequential pipeline of heterogeneous agents. During data generation, each agent is repeatedly sampled to form a multi-agent search tree, where final outputs are graded against ground-truth data. We then apply value iteration to propagate reward signals back to each role-conditioned model, automatically producing multi-agent post-training data without human or teacher-model supervision. Our off-policy approach allows each agent to specialize by learning from correct and incorrect trajectories, ultimately improving the end-to-end reasoning chain. On MATH, GSM8K, and CSQA, MALT surpasses the same baseline LLM with a relative improvement of 15.66%, 7.42%, and 9.40% respectively, making it an important advance towards multi-agent cooperative training.

Cite

Text

Motwani et al. "MALT: Improving Reasoning with Multi-Agent LLM Training." ICLR 2025 Workshops: LLM_Reason_and_Plan, 2025.

Markdown

[Motwani et al. "MALT: Improving Reasoning with Multi-Agent LLM Training." ICLR 2025 Workshops: LLM_Reason_and_Plan, 2025.](https://mlanthology.org/iclrw/2025/motwani2025iclrw-malt/)

BibTeX

@inproceedings{motwani2025iclrw-malt,
  title     = {{MALT: Improving Reasoning with Multi-Agent LLM Training}},
  author    = {Motwani, Sumeet Ramesh and Smith, Chandler and Das, Rocktim Jyoti and Rafailov, Rafael and Laptev, Ivan and Torr, Philip and Pizzati, Fabio and Clark, Ronald and de Witt, Christian Schroeder},
  booktitle = {ICLR 2025 Workshops: LLM_Reason_and_Plan},
  year      = {2025},
  url       = {https://mlanthology.org/iclrw/2025/motwani2025iclrw-malt/}
}