Just Ask: Learning to Answer Questions from Millions of Narrated Videos

Abstract

Recent methods for visual question answering rely on large-scale annotated datasets. Manual annotation of questions and answers for videos, however, is tedious, expensive and prevents scalability. In this work, we propose to avoid manual annotation and generate a large-scale training dataset for video question answering making use of automatic cross-modal supervision. We leverage a question generation transformer trained on text data and use it to generate question-answer pairs from transcribed video narrations. Given narrated videos, we then automatically generate the HowToVQA69M dataset with 69M video-question-answer triplets. To handle the open vocabulary of diverse answers in this dataset, we propose a training procedure based on a contrastive loss between a video-question multi-modal transformer and an answer transformer. We introduce the zero-shot VideoQA task and show excellent results, in particular for rare answers. Furthermore, we demonstrate our method to significantly outperform the state of the art on MSRVTT-QA, MSVD-QA, ActivityNet-QA and How2QA. Finally, for a detailed evaluation we introduce iVQA, a new VideoQA dataset with reduced language biases and high-quality redundant manual annotations.

Cite

Text

Yang et al. "Just Ask: Learning to Answer Questions from Millions of Narrated Videos." International Conference on Computer Vision, 2021. doi:10.1109/ICCV48922.2021.00171

Markdown

[Yang et al. "Just Ask: Learning to Answer Questions from Millions of Narrated Videos." International Conference on Computer Vision, 2021.](https://mlanthology.org/iccv/2021/yang2021iccv-just/) doi:10.1109/ICCV48922.2021.00171

BibTeX

@inproceedings{yang2021iccv-just,
  title     = {{Just Ask: Learning to Answer Questions from Millions of Narrated Videos}},
  author    = {Yang, Antoine and Miech, Antoine and Sivic, Josef and Laptev, Ivan and Schmid, Cordelia},
  booktitle = {International Conference on Computer Vision},
  year      = {2021},
  pages     = {1686-1697},
  doi       = {10.1109/ICCV48922.2021.00171},
  url       = {https://mlanthology.org/iccv/2021/yang2021iccv-just/}
}