MM-Narrator: Narrating Long-Form Videos with Multimodal In-Context Learning

Abstract

We present MM-Narrator a novel system leveraging GPT-4 with multimodal in-context learning for the generation of audio descriptions (AD). Unlike previous methods that primarily focused on downstream fine-tuning with short video clips MM-Narrator excels in generating precise audio descriptions for videos of extensive lengths even beyond hours in an autoregressive manner. This capability is made possible by the proposed memory-augmented generation process which effectively utilizes both the short-term textual context and long-term visual memory through an efficient register-and-recall mechanism. These contextual memories compile pertinent past information including storylines and character identities ensuring an accurate tracking and depicting of story-coherent and character-centric audio descriptions. Maintaining the training-free design of MM-Narrator we further propose a complexity-based demonstration selection strategy to largely enhance its multi-step reasoning capability via few-shot multimodal in-context learning (MM-ICL). Experimental results on MAD-eval dataset demonstrate that MM-Narrator consistently outperforms both the existing fine-tuning-based approaches and LLM-based approaches in most scenarios as measured by standard evaluation metrics. Additionally we introduce the first segment-based evaluator for recurrent text generation. Empowered by GPT-4 this evaluator comprehensively reasons and marks AD generation performance in various extendable dimensions.

Cite

Text

Zhang et al. "MM-Narrator: Narrating Long-Form Videos with Multimodal In-Context Learning." Conference on Computer Vision and Pattern Recognition, 2024. doi:10.1109/CVPR52733.2024.01295

Markdown

[Zhang et al. "MM-Narrator: Narrating Long-Form Videos with Multimodal In-Context Learning." Conference on Computer Vision and Pattern Recognition, 2024.](https://mlanthology.org/cvpr/2024/zhang2024cvpr-mmnarrator/) doi:10.1109/CVPR52733.2024.01295

BibTeX

@inproceedings{zhang2024cvpr-mmnarrator,
  title     = {{MM-Narrator: Narrating Long-Form Videos with Multimodal In-Context Learning}},
  author    = {Zhang, Chaoyi and Lin, Kevin and Yang, Zhengyuan and Wang, Jianfeng and Li, Linjie and Lin, Chung-Ching and Liu, Zicheng and Wang, Lijuan},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2024},
  pages     = {13647-13657},
  doi       = {10.1109/CVPR52733.2024.01295},
  url       = {https://mlanthology.org/cvpr/2024/zhang2024cvpr-mmnarrator/}
}