VideoAgent: A Memory-Augmented Multimodal Agent for Video Understanding

Abstract

We explore how reconciling several foundation models (large language models and vision-language models) with a novel unified memory mechanism could tackle the challenging video understanding problem, especially capturing the long-term temporal relations in lengthy videos. In particular, the proposed multimodal agent : 1) constructs a structured memory to store both the generic temporal event descriptions and object-centric tracking states of the video; 2) given an input task query, it employs tools including video segment localization and object memory querying along with other visual foundation models to interactively solve the task, utilizing the zero-shot tool-use ability of LLMs. demonstrates impressive performances on several long-horizon video understanding benchmarks, an average increase of 6.6% on NExT-QA and 26.0% on EgoSchema over baselines, closing the gap between open-sourced models and private counterparts including Gemini 1.5 Pro. The code and demo can be found at https:// videoagent.github.io.

Cite

Text

Fan et al. "VideoAgent: A Memory-Augmented Multimodal Agent for Video Understanding." Proceedings of the European Conference on Computer Vision (ECCV), 2024. doi:10.1007/978-3-031-72670-5_5

Markdown

[Fan et al. "VideoAgent: A Memory-Augmented Multimodal Agent for Video Understanding." Proceedings of the European Conference on Computer Vision (ECCV), 2024.](https://mlanthology.org/eccv/2024/fan2024eccv-videoagent/) doi:10.1007/978-3-031-72670-5_5

BibTeX

@inproceedings{fan2024eccv-videoagent,
  title     = {{VideoAgent: A Memory-Augmented Multimodal Agent for Video Understanding}},
  author    = {Fan, Yue and Ma, Xiaojian and Wu, Rujie and Du, Yuntao and Li, Jiaqi and Gao, Zhi and Li, Qing},
  booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
  year      = {2024},
  doi       = {10.1007/978-3-031-72670-5_5},
  url       = {https://mlanthology.org/eccv/2024/fan2024eccv-videoagent/}
}