Zero-Shot Video Moment Retrieval via Off-the-Shelf Multimodal Large Language Models

Abstract

The target of video moment retrieval (VMR) is predicting temporal spans within a video that semantically match a given linguistic query. Existing VMR methods based on multimodal large language models (MLLMs) overly rely on expensive high-quality datasets and time-consuming fine-tuning. Although some recent studies introduce a zero-shot setting to avoid fine-tuning, they overlook inherent language bias in the query, leading to erroneous localization. To tackle the aforementioned challenges, this paper proposes Moment-GPT, a tuning-free pipeline for zero-shot VMR utilizing frozen MLLMs. Specifically, we first employ LLaMA-3 to correct and rephrase the query to mitigate language bias. Subsequently, we design a span generator combined with MiniGPT-v2 to produce candidate spans adaptively. Finally, to leverage the video comprehension capabilities of MLLMs, we apply Video-ChatGPT and span scorer to select the most appropriate spans. Our proposed method substantially outperforms the state-of-the-art MLLM-based and zero-shot models on several public datasets, including QVHighlights, ActivityNet-Captions, and Charades-STA.

Cite

Text

Xu et al. "Zero-Shot Video Moment Retrieval via Off-the-Shelf Multimodal Large Language Models." AAAI Conference on Artificial Intelligence, 2025. doi:10.1609/AAAI.V39I9.32971

Markdown

[Xu et al. "Zero-Shot Video Moment Retrieval via Off-the-Shelf Multimodal Large Language Models." AAAI Conference on Artificial Intelligence, 2025.](https://mlanthology.org/aaai/2025/xu2025aaai-zero/) doi:10.1609/AAAI.V39I9.32971

BibTeX

@inproceedings{xu2025aaai-zero,
  title     = {{Zero-Shot Video Moment Retrieval via Off-the-Shelf Multimodal Large Language Models}},
  author    = {Xu, Yifang and Sun, Yunzhuo and Zhai, Benxiang and Li, Ming and Liang, Wenxin and Li, Yang and Du, Sidan},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2025},
  pages     = {8978-8986},
  doi       = {10.1609/AAAI.V39I9.32971},
  url       = {https://mlanthology.org/aaai/2025/xu2025aaai-zero/}
}