Evaluating Generalization Capabilities of LLM-Based Agents in Mixed-Motive Scenarios Using Concordia

Abstract

Large Language Model (LLM) agents have demonstrated impressive capabilities for social interaction and are increasingly being deployed in situations where they might engage with both human and artificial agents. These interactions represent a critical frontier for LLM-based agents, yet existing evaluation methods fail to measure how well these capabilities generalize to novel social situations. In this paper, we introduce a method for evaluating the ability of LLM-based agents to cooperate in zero-shot, mixed-motive environments using Concordia, a natural language multi-agent simulation environment. Our method measures general cooperative intelligence by testing an agent's ability to identify and exploit opportunities for mutual gain across diverse partners and contexts. We present empirical results from the NeurIPS 2024 Concordia Contest, where agents were evaluated on their ability to achieve mutual gains across a suite of diverse scenarios ranging from negotiation to collective action problems. Our findings reveal significant gaps between current agent capabilities and the robust generalization required for reliable cooperation, particularly in scenarios demanding persuasion and norm enforcement.

Cite

Text

Smith et al. "Evaluating Generalization Capabilities of LLM-Based Agents in Mixed-Motive Scenarios Using Concordia." Advances in Neural Information Processing Systems, 2025.

Markdown

[Smith et al. "Evaluating Generalization Capabilities of LLM-Based Agents in Mixed-Motive Scenarios Using Concordia." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/smith2025neurips-evaluating/)

BibTeX

@inproceedings{smith2025neurips-evaluating,
  title     = {{Evaluating Generalization Capabilities of LLM-Based Agents in Mixed-Motive Scenarios Using Concordia}},
  author    = {Smith, Chandler and Abdulhai, Marwa and Diaz, Manfred and Tesic, Marko and Trivedi, Rakshit and Vezhnevets, Sasha and Hammond, Lewis and Clifton, Jesse and Chang, Minsuk and Duéñez-Guzmán, Edgar A. and Agapiou, John P and Matyas, Jayd and Karmon, Danny and Zhang, Beining and Dilkes, Jim and Kundu, Akash and Nguyen, Jord and Tewolde, Emanuel and Purbey, Jebish and Kadiyala, Ram Mohan Rao and Gupta, Siddhant and Korshuk, Aliaksei and Alexander, Buyantuev and Makarov, Ilya and Zhao, Gang and Fernandez, Rolando and Wang, Zhihan and Wang, Caroline and Cui, Jiaxun and Xiao, Lingyun and Shi, Di Yang and Sung, Yoonchang and Rahman, Arrasy and Stone, Peter and Kang, Yipeng and Yun, Hyeonggeun and Ananya, Ananya and Cha, Taehun and Wu, Zhiqiang and Tennant, Elizaveta and Macmillan-Scott, Olivia and Segura, Marta Emili García and Riazi, Diana and Cui, Fuyang and Subramanian, Sriram Ganapathi and Klassen, Toryn Q. and Schiavone, Nico and Alim, Mogtaba and McIlraith, Sheila A. and Beltran, Manuel Sebastian Rios and Peña, Oswaldo and Rojas, Carlos Saith Rodriguez and Chacon-Chamorro, Manuela and Manrique, Ruben and Giraldo, Luis Felipe and Quijano, Nicanor and Wang, Yiding and Chen, Yuxuan and Zhong, Fangwei and Wang, Mengmeng and Tu, Wenming and Zhang, Zhaowei and Chen, Ziang and Jia, Zixia and Feng, Xue and Zheng, Zilong and Lin, Chichen and Fan, Weijian and Liu, Chenao and Sarangi, Sneheel and Wang, Ziyan and Shi, Shuqing and Du, Yali and Kulandaivel, Avinaash Anand and Liu, Yang and Ruiyang, Wu and Talele, Chetan and 陆孙嘉,  and Piqueras, Gema Parreño and Dhuri, Shamika and McHale, Bain and Baarslag, Tim and Hadfield-Menell, Dylan and Jaques, Natasha and Hernandez-Orallo, Jose and Leibo, Joel Z},
  booktitle = {Advances in Neural Information Processing Systems},
  year      = {2025},
  url       = {https://mlanthology.org/neurips/2025/smith2025neurips-evaluating/}
}