MuMA-ToM: Multi-Modal Multi-Agent Theory of Mind
Abstract
Understanding people’s social interactions in complex real- world scenarios often relies on intricate mental reasoning. To truly understand how and why people interact with one another, we must infer the underlying mental states that give rise to the social interactions, i.e., Theory of Mind reason- ing in multi-agent interactions. Additionally, social interac- tions are often multi-modal – we can watch people’s actions, hear their conversations, and/or read about their past behav- iors. For AI systems to successfully and safely interact with people in real-world environments, they also need to under- stand people’s mental states as well as their inferences about each other’s mental states based on multi-modal information about their interactions. For this, we introduce MuMA-ToM, a Multi-modal Multi-Agent Theory of Mind benchmark. MuMA-ToM is the first multi-modal Theory of Mind bench- mark that evaluates mental reasoning in embodied multi- agent interactions. In MuMA-ToM, we provide video and text descriptions of people’s multi-modal behavior in realis- tic household environments. Based on the context, we then ask questions about people’s goals, beliefs, and beliefs about others’ goals. We validated MuMA-ToM in a human ex- periment and provided a human baseline. We also proposed a novel multi-modal, multi-agent ToM model, LIMP (Lan- guage model-based Inverse Multi-agent Planning). Our ex- perimental results show that LIMP significantly outperforms state-of-the-art methods, including large multi-modal mod- els (e.g., GPT-4o, Gemini-1.5 Pro) and a recent multi-modal ToM model, BIP-ALM.
Cite
Text
Shi et al. "MuMA-ToM: Multi-Modal Multi-Agent Theory of Mind." NeurIPS 2024 Workshops: Video-Langauge_Models, 2024.Markdown
[Shi et al. "MuMA-ToM: Multi-Modal Multi-Agent Theory of Mind." NeurIPS 2024 Workshops: Video-Langauge_Models, 2024.](https://mlanthology.org/neuripsw/2024/shi2024neuripsw-mumatom-a/)BibTeX
@inproceedings{shi2024neuripsw-mumatom-a,
title = {{MuMA-ToM: Multi-Modal Multi-Agent Theory of Mind}},
author = {Shi, Haojun and Ye, Suyu and Fang, Xinyu and Jin, Chuanyang and Isik, Leyla and Kuo, Yen-Ling and Shu, Tianmin},
booktitle = {NeurIPS 2024 Workshops: Video-Langauge_Models},
year = {2024},
url = {https://mlanthology.org/neuripsw/2024/shi2024neuripsw-mumatom-a/}
}