Seeing, Listening, Remembering, and Reasoning: A Multimodal Agent with Long-Term Memory

Abstract

We introduce M3-Agent, a novel multimodal agent framework equipped with long-term memory. Like humans, M3-Agent can process real-time visual and auditory inputs to build and update episodic and semantic memories, gradually accumulating world knowledge. Its memory is organized in an entity-centric, multimodal manner, enabling deeper and more consistent understanding of the environment. Given an instruction, M3-Agent autonomously performs multi-turn reasoning and retrieves relevant memories to complete tasks. To evaluate memory effectiveness and memory-based reasoning in multimodal agents, we develop M3-Bench, a long-video question answering benchmark comprising 100 newly recorded robot-perspective videos (M3-Bench-robot) and 920 diverse web-sourced videos (M3-Bench-web). We annotate QA pairs designed to test capabilities essential for agent applications, such as person understanding, general knowledge extraction, and cross-modal reasoning. Experimental results show that M3-Agent, trained via reinforcement learning, outperforms the strongest baseline, a prompting agent using Gemini-1.5-pro and GPT-4o, achieving 6.7%, 7.7%, and 5.3% higher accuracy on M3-Bench-robot, M3-Bench-web and VideoMME-long, respectively. Our work advances multimodal agents toward more human-like long-term memory and provides insights for their practical design. Models, datasets and code are available at https://github.com/ByteDance-Seed/m3-agent.

Cite

Text

Long et al. "Seeing, Listening, Remembering, and Reasoning: A Multimodal Agent with Long-Term Memory." International Conference on Learning Representations, 2026.

Markdown

[Long et al. "Seeing, Listening, Remembering, and Reasoning: A Multimodal Agent with Long-Term Memory." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/long2026iclr-seeing/)

BibTeX

@inproceedings{long2026iclr-seeing,
  title     = {{Seeing, Listening, Remembering, and Reasoning: A Multimodal Agent with Long-Term Memory}},
  author    = {Long, Lin and He, Yichen and Ye, Wentao and Pan, Yiyuan and Lin, Yuan and Li, Hang and Zhao, Junbo and Li, Wei},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/long2026iclr-seeing/}
}