MM-Narrator: Narrating Long-Form Videos with Multimodal In-Context Learning
Abstract
We present MM-Narrator a novel system leveraging GPT-4 with multimodal in-context learning for the generation of audio descriptions (AD). Unlike previous methods that primarily focused on downstream fine-tuning with short video clips MM-Narrator excels in generating precise audio descriptions for videos of extensive lengths even beyond hours in an autoregressive manner. This capability is made possible by the proposed memory-augmented generation process which effectively utilizes both the short-term textual context and long-term visual memory through an efficient register-and-recall mechanism. These contextual memories compile pertinent past information including storylines and character identities ensuring an accurate tracking and depicting of story-coherent and character-centric audio descriptions. Maintaining the training-free design of MM-Narrator we further propose a complexity-based demonstration selection strategy to largely enhance its multi-step reasoning capability via few-shot multimodal in-context learning (MM-ICL). Experimental results on MAD-eval dataset demonstrate that MM-Narrator consistently outperforms both the existing fine-tuning-based approaches and LLM-based approaches in most scenarios as measured by standard evaluation metrics. Additionally we introduce the first segment-based evaluator for recurrent text generation. Empowered by GPT-4 this evaluator comprehensively reasons and marks AD generation performance in various extendable dimensions.
Cite
Text
Zhang et al. "MM-Narrator: Narrating Long-Form Videos with Multimodal In-Context Learning." Conference on Computer Vision and Pattern Recognition, 2024. doi:10.1109/CVPR52733.2024.01295Markdown
[Zhang et al. "MM-Narrator: Narrating Long-Form Videos with Multimodal In-Context Learning." Conference on Computer Vision and Pattern Recognition, 2024.](https://mlanthology.org/cvpr/2024/zhang2024cvpr-mmnarrator/) doi:10.1109/CVPR52733.2024.01295BibTeX
@inproceedings{zhang2024cvpr-mmnarrator,
title = {{MM-Narrator: Narrating Long-Form Videos with Multimodal In-Context Learning}},
author = {Zhang, Chaoyi and Lin, Kevin and Yang, Zhengyuan and Wang, Jianfeng and Li, Linjie and Lin, Chung-Ching and Liu, Zicheng and Wang, Lijuan},
booktitle = {Conference on Computer Vision and Pattern Recognition},
year = {2024},
pages = {13647-13657},
doi = {10.1109/CVPR52733.2024.01295},
url = {https://mlanthology.org/cvpr/2024/zhang2024cvpr-mmnarrator/}
}