ANetQA: A Large-Scale Benchmark for Fine-Grained Compositional Reasoning over Untrimmed Videos

Abstract

Building benchmarks to systemically analyze different capabilities of video question answering (VideoQA) models is challenging yet crucial. Existing benchmarks often use non-compositional simple questions and suffer from language biases, making it difficult to diagnose model weaknesses incisively. A recent benchmark AGQA poses a promising paradigm to generate QA pairs automatically from pre-annotated scene graphs, enabling it to measure diverse reasoning abilities with granular control. However, its questions have limitations in reasoning about the fine-grained semantics in videos as such information is absent in its scene graphs. To this end, we present ANetQA, a large-scale benchmark that supports fine-grained compositional reasoning over the challenging untrimmed videos from ActivityNet. Similar to AGQA, the QA pairs in ANetQA are automatically generated from annotated video scene graphs. The fine-grained properties of ANetQA are reflected in the following: (i) untrimmed videos with fine-grained semantics; (ii) spatio-temporal scene graphs with fine-grained taxonomies; and (iii) diverse questions generated from fine-grained templates. ANetQA attains 1.4 billion unbalanced and 13.4 million balanced QA pairs, which is an order of magnitude larger than AGQA with a similar number of videos. Comprehensive experiments are performed for state-of-the-art methods. The best model achieves 44.5% accuracy while human performance tops out at 84.5%, leaving sufficient room for improvement.

Cite

Text

Yu et al. "ANetQA: A Large-Scale Benchmark for Fine-Grained Compositional Reasoning over Untrimmed Videos." Conference on Computer Vision and Pattern Recognition, 2023. doi:10.1109/CVPR52729.2023.02221

Markdown

[Yu et al. "ANetQA: A Large-Scale Benchmark for Fine-Grained Compositional Reasoning over Untrimmed Videos." Conference on Computer Vision and Pattern Recognition, 2023.](https://mlanthology.org/cvpr/2023/yu2023cvpr-anetqa/) doi:10.1109/CVPR52729.2023.02221

BibTeX

@inproceedings{yu2023cvpr-anetqa,
  title     = {{ANetQA: A Large-Scale Benchmark for Fine-Grained Compositional Reasoning over Untrimmed Videos}},
  author    = {Yu, Zhou and Zheng, Lixiang and Zhao, Zhou and Wu, Fei and Fan, Jianping and Ren, Kui and Yu, Jun},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2023},
  pages     = {23191-23200},
  doi       = {10.1109/CVPR52729.2023.02221},
  url       = {https://mlanthology.org/cvpr/2023/yu2023cvpr-anetqa/}
}