Leveraging Video Descriptions to Learn Video Question Answering

Abstract

We propose a scalable approach to learn video-based question answering (QA): to answer a free-form natural language question about the contents of a video. Our approach automatically harvests a large number of videos and descriptions freely available online. Then, a large number of candidate QA pairs are automatically generated from descriptions rather than manually annotated. Next, we use these candidate QA pairs to train a number of video-based QA methods extended from MN (Sukhbaatar et al. 2015), VQA (Antol et al. 2015), SA (Yao et al. 2015), and SS (Venugopalan et al. 2015). In order to handle non-perfect candidate QA pairs, we propose a self-paced learning procedure to iteratively identify them and mitigate their effects in training. Finally, we evaluate performance on manually generated video-based QA pairs. The results show that our self-paced learning procedure is effective, and the extended SS model outperforms various baselines.

Cite

Text

Zeng et al. "Leveraging Video Descriptions to Learn Video Question Answering." AAAI Conference on Artificial Intelligence, 2017. doi:10.1609/AAAI.V31I1.11238

Markdown

[Zeng et al. "Leveraging Video Descriptions to Learn Video Question Answering." AAAI Conference on Artificial Intelligence, 2017.](https://mlanthology.org/aaai/2017/zeng2017aaai-leveraging/) doi:10.1609/AAAI.V31I1.11238

BibTeX

@inproceedings{zeng2017aaai-leveraging,
  title     = {{Leveraging Video Descriptions to Learn Video Question Answering}},
  author    = {Zeng, Kuo-Hao and Chen, Tseng-Hung and Chuang, Ching-Yao and Liao, Yuan-Hong and Niebles, Juan Carlos and Sun, Min},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2017},
  pages     = {4334-4340},
  doi       = {10.1609/AAAI.V31I1.11238},
  url       = {https://mlanthology.org/aaai/2017/zeng2017aaai-leveraging/}
}