A Video-Grounded Dialogue Dataset and Metric for Event-Driven Activities

Imrattanatrai, Wiradee; Asada, Masaki; Hasegawa, Kimihiro; Cheng, Zhi-Qi; Fukuda, Ken; Mitamura, Teruko

doi:10.1609/AAAI.V39I23.34596

A Video-Grounded Dialogue Dataset and Metric for Event-Driven Activities

Wiradee Imrattanatrai, Masaki Asada, Kimihiro Hasegawa, Zhi-Qi Cheng, Ken Fukuda, Teruko Mitamura

AAAI 2025 pp. 24203-24211

doi:10.1609/AAAI.V39I23.34596 /aaai/2025/imrattanatrai2025aaai-video/

Abstract

This paper presents VDAct, a dataset for a Video-grounded Dialogue on Event-driven Activities, alongside VDEval, a session-based context evaluation metric specially designed for the task. Unlike existing datasets, VDAct includes longer and more complex video sequences that depict a variety of event-driven activities that require advanced contextual understanding for accurate response generation. The dataset comprises 3,000 dialogues with over 30,000 question-and-answer pairs, derived from 1,000 videos with diverse activity scenarios. VDAct displays a notably challenging characteristic due to its broad spectrum of activity scenarios and wide range of question types. Empirical studies on state-of-the-art vision foundation models highlight their limitations in addressing certain question types on our dataset. Furthermore, VDEval, which integrates dialogue session history and video content summaries extracted from our supplementary Knowledge Graphs to evaluate individual responses, demonstrates a significantly higher correlation with human assessments on the VDAct dataset than existing evaluation metrics that rely solely on the context of single dialogue turns.

PDF AAAI Semantic Scholar

Cite

Text

Imrattanatrai et al. "A Video-Grounded Dialogue Dataset and Metric for Event-Driven Activities." AAAI Conference on Artificial Intelligence, 2025. doi:10.1609/AAAI.V39I23.34596

Markdown

[Imrattanatrai et al. "A Video-Grounded Dialogue Dataset and Metric for Event-Driven Activities." AAAI Conference on Artificial Intelligence, 2025.](https://mlanthology.org/aaai/2025/imrattanatrai2025aaai-video/) doi:10.1609/AAAI.V39I23.34596

BibTeX

@inproceedings{imrattanatrai2025aaai-video,
  title     = {{A Video-Grounded Dialogue Dataset and Metric for Event-Driven Activities}},
  author    = {Imrattanatrai, Wiradee and Asada, Masaki and Hasegawa, Kimihiro and Cheng, Zhi-Qi and Fukuda, Ken and Mitamura, Teruko},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2025},
  pages     = {24203-24211},
  doi       = {10.1609/AAAI.V39I23.34596},
  url       = {https://mlanthology.org/aaai/2025/imrattanatrai2025aaai-video/}
}