Are We Going MAD? Benchmarking Multi-Agent Debate Between Language Models for Medical Q&A

Abstract

Recent advancements in large language models (LLMs) underscore their potential for responding to medical inquiries. However, ensuring that generative agents provide accurate and reliable answers remains an ongoing challenge. In this context, multi-agent debate (MAD) has emerged as a prominent strategy for enhancing the truthfulness of LLMs. In this work, we provide a comprehensive benchmark of MAD strategies for medical Q&A, along with open-source implementations. This sheds light on the effective utilization of various strategies including the trade-offs between cost, time, and accuracy. We build upon these insights to provide a novel debate-prompting strategy based on agent agreement that outperforms previously published strategies on medical Q&A tasks.

Cite

Text

Smit et al. "Are We Going MAD? Benchmarking Multi-Agent Debate Between Language Models for Medical Q&A." NeurIPS 2023 Workshops: DGM4H, 2023.

Markdown

[Smit et al. "Are We Going MAD? Benchmarking Multi-Agent Debate Between Language Models for Medical Q&A." NeurIPS 2023 Workshops: DGM4H, 2023.](https://mlanthology.org/neuripsw/2023/smit2023neuripsw-we/)

BibTeX

@inproceedings{smit2023neuripsw-we,
  title     = {{Are We Going MAD? Benchmarking Multi-Agent Debate Between Language Models for Medical Q&A}},
  author    = {Smit, Andries Petrus and Duckworth, Paul and Grinsztajn, Nathan and Tessera, Kale-ab and Barrett, Thomas D and Pretorius, Arnu},
  booktitle = {NeurIPS 2023 Workshops: DGM4H},
  year      = {2023},
  url       = {https://mlanthology.org/neuripsw/2023/smit2023neuripsw-we/}
}