CinePile: A Long Video Question Answering Dataset and Benchmark

Abstract

Current long-form video understanding datasets often fail to provide genuine comprehension challenges, as many tasks can be solved by analyzing only a few random frames. To address this issue, we present a novel dataset and benchmark, CinePile, specifically designed for authentic long-form video understanding. This paper details our innovative approach for creating a question-answer dataset, utilizing advanced LLMs with human-in-the-loop and building upon human-generated raw data. Our comprehensive dataset comprises 305,000 multiple-choice questions (MCQs), covering various visual and multimodal aspects, including temporal comprehension, understanding human-object interactions, and reasoning about events or actions within a scene. Additionally, we evaluate recent video-centric LLMs, both open-source and proprietary, on the test split of our dataset. The findings reveal that even state-of-the-art video-centric LLMs significantly lag behind human performance in these tasks, highlighting the complexity and challenge inherent in video understanding.

Cite

Text

Rawal et al. "CinePile: A Long Video Question Answering Dataset and Benchmark." NeurIPS 2024 Workshops: Video-Langauge_Models, 2024.

Markdown

[Rawal et al. "CinePile: A Long Video Question Answering Dataset and Benchmark." NeurIPS 2024 Workshops: Video-Langauge_Models, 2024.](https://mlanthology.org/neuripsw/2024/rawal2024neuripsw-cinepile/)

BibTeX

@inproceedings{rawal2024neuripsw-cinepile,
  title     = {{CinePile: A Long Video Question Answering Dataset and Benchmark}},
  author    = {Rawal, Ruchit and Saifullah, Khalid and Basri, Ronen and Jacobs, David and Somepalli, Gowthami and Goldstein, Tom},
  booktitle = {NeurIPS 2024 Workshops: Video-Langauge_Models},
  year      = {2024},
  url       = {https://mlanthology.org/neuripsw/2024/rawal2024neuripsw-cinepile/}
}