MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding

Abstract

With the success of large language models (LLMs) integrating the vision model into LLMs to build vision-language foundation models has gained much more interest recently. However existing LLM-based large multimodal models (e.g. Video-LLaMA VideoChat) can only take in a limited number of frames for short video understanding. In this study we mainly focus on designing an efficient and effective model for long-term video understanding. Instead of trying to process more frames simultaneously like most existing work we propose to process videos in an online manner and store past video information in a memory bank. This allows our model to reference historical video content for long-term analysis without exceeding LLMs' context length constraints or GPU memory limits. Our memory bank can be seamlessly integrated into current multimodal LLMs in an off-the-shelf manner. We conduct extensive experiments on various video understanding tasks such as long-video understanding video question answering and video captioning and our model can achieve state-of-the-art performances across multiple datasets.

Cite

Text

He et al. "MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding." Conference on Computer Vision and Pattern Recognition, 2024. doi:10.1109/CVPR52733.2024.01282

Markdown

[He et al. "MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding." Conference on Computer Vision and Pattern Recognition, 2024.](https://mlanthology.org/cvpr/2024/he2024cvpr-malmm/) doi:10.1109/CVPR52733.2024.01282

BibTeX

@inproceedings{he2024cvpr-malmm,
  title     = {{MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding}},
  author    = {He, Bo and Li, Hengduo and Jang, Young Kyun and Jia, Menglin and Cao, Xuefei and Shah, Ashish and Shrivastava, Abhinav and Lim, Ser-Nam},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2024},
  pages     = {13504-13514},
  doi       = {10.1109/CVPR52733.2024.01282},
  url       = {https://mlanthology.org/cvpr/2024/he2024cvpr-malmm/}
}