Shot2Story: A New Benchmark for Comprehensive Understanding of Multi-Shot Videos

Abstract

A short clip of video may contain progression of multiple events and an interesting story line. A human need to capture both the event in every shot and associate them together to understand the story behind it. In this work, we present a new multi-shot video understanding benchmark \dataset with detailed shot-level captions, comprehensive video summaries and question-answering pairs. To facilitate better semantic understanding of videos, we provide captions for both visual signals and human narrations. We design several distinct tasks including single-shot video captioning, multi-shot video summarization, and multi-shot video question answering. Preliminary experiments show some challenges to generate a long and comprehensive video summary for multi-shot videos. Nevertheless, the generated imperfect summaries can already achieve competitive performance on existing video understanding tasks such as video question-answering, promoting an under-explored setting of video understanding with detailed summaries.

Cite

Text

Han et al. "Shot2Story: A New Benchmark for Comprehensive Understanding of Multi-Shot Videos." International Conference on Learning Representations, 2025.

Markdown

[Han et al. "Shot2Story: A New Benchmark for Comprehensive Understanding of Multi-Shot Videos." International Conference on Learning Representations, 2025.](https://mlanthology.org/iclr/2025/han2025iclr-shot2story/)

BibTeX

@inproceedings{han2025iclr-shot2story,
  title     = {{Shot2Story: A New Benchmark for Comprehensive Understanding of Multi-Shot Videos}},
  author    = {Han, Mingfei and Yang, Linjie and Chang, Xiaojun and Yao, Lina and Wang, Heng},
  booktitle = {International Conference on Learning Representations},
  year      = {2025},
  url       = {https://mlanthology.org/iclr/2025/han2025iclr-shot2story/}
}