BERT Representations for Video Question Answering

Abstract

Visual question answering (VQA) aims at answering questions about the visual content of an image or a video. Currently, most work on VQA is focused on image-based question answering, and less attention has been paid into answering questions about videos. However, VQA in video presents some unique challenges that are worth studying: it not only requires to model a sequence of visual features over time, but often it also needs to reason about associated subtitles. In this work, we propose to use BERT, a sequential modelling technique based on Transformers, to encode the complex semantics from video clips. Our proposed model jointly captures the visual and language information of a video scene by encoding not only the subtitles but also a sequence of visual concepts with a pre-trained language-based Transformer. In our experiments, we exhaustively study the performance of our model by taking different input arrangements, showing outstanding improvements when compared against previous work on two well-known video VQA datasets: TVQA and Pororo.

Cite

Text

Yang et al. "BERT Representations for Video Question Answering." Winter Conference on Applications of Computer Vision, 2020.

Markdown

[Yang et al. "BERT Representations for Video Question Answering." Winter Conference on Applications of Computer Vision, 2020.](https://mlanthology.org/wacv/2020/yang2020wacv-bert/)

BibTeX

@inproceedings{yang2020wacv-bert,
  title     = {{BERT Representations for Video Question Answering}},
  author    = {Yang, Zekun and Garcia, Noa and Chu, Chenhui and Otani, Mayu and Nakashima, Yuta and Takemura, Haruo},
  booktitle = {Winter Conference on Applications of Computer Vision},
  year      = {2020},
  url       = {https://mlanthology.org/wacv/2020/yang2020wacv-bert/}
}