Zero-Shot Video Moment Retrieval via Off-the-Shelf Multimodal Large Language Models

Xu, Yifang; Sun, Yunzhuo; Zhai, Benxiang; Li, Ming; Liang, Wenxin; Li, Yang; Du, Sidan

doi:10.1609/AAAI.V39I9.32971

Zero-Shot Video Moment Retrieval via Off-the-Shelf Multimodal Large Language Models

Yifang Xu, Yunzhuo Sun, Benxiang Zhai, Ming Li, Wenxin Liang, Yang Li, Sidan Du

AAAI 2025 pp. 8978-8986

doi:10.1609/AAAI.V39I9.32971 /aaai/2025/xu2025aaai-zero/

Abstract

The target of video moment retrieval (VMR) is predicting temporal spans within a video that semantically match a given linguistic query. Existing VMR methods based on multimodal large language models (MLLMs) overly rely on expensive high-quality datasets and time-consuming fine-tuning. Although some recent studies introduce a zero-shot setting to avoid fine-tuning, they overlook inherent language bias in the query, leading to erroneous localization. To tackle the aforementioned challenges, this paper proposes Moment-GPT, a tuning-free pipeline for zero-shot VMR utilizing frozen MLLMs. Specifically, we first employ LLaMA-3 to correct and rephrase the query to mitigate language bias. Subsequently, we design a span generator combined with MiniGPT-v2 to produce candidate spans adaptively. Finally, to leverage the video comprehension capabilities of MLLMs, we apply Video-ChatGPT and span scorer to select the most appropriate spans. Our proposed method substantially outperforms the state-of-the-art MLLM-based and zero-shot models on several public datasets, including QVHighlights, ActivityNet-Captions, and Charades-STA.

PDF AAAI Semantic Scholar

Cite

Text

Xu et al. "Zero-Shot Video Moment Retrieval via Off-the-Shelf Multimodal Large Language Models." AAAI Conference on Artificial Intelligence, 2025. doi:10.1609/AAAI.V39I9.32971

Markdown

[Xu et al. "Zero-Shot Video Moment Retrieval via Off-the-Shelf Multimodal Large Language Models." AAAI Conference on Artificial Intelligence, 2025.](https://mlanthology.org/aaai/2025/xu2025aaai-zero/) doi:10.1609/AAAI.V39I9.32971

BibTeX

@inproceedings{xu2025aaai-zero,
  title     = {{Zero-Shot Video Moment Retrieval via Off-the-Shelf Multimodal Large Language Models}},
  author    = {Xu, Yifang and Sun, Yunzhuo and Zhai, Benxiang and Li, Ming and Liang, Wenxin and Li, Yang and Du, Sidan},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2025},
  pages     = {8978-8986},
  doi       = {10.1609/AAAI.V39I9.32971},
  url       = {https://mlanthology.org/aaai/2025/xu2025aaai-zero/}
}