BERT Representations for Video Question Answering
Abstract
Visual question answering (VQA) aims at answering questions about the visual content of an image or a video. Currently, most work on VQA is focused on image-based question answering, and less attention has been paid into answering questions about videos. However, VQA in video presents some unique challenges that are worth studying: it not only requires to model a sequence of visual features over time, but often it also needs to reason about associated subtitles. In this work, we propose to use BERT, a sequential modelling technique based on Transformers, to encode the complex semantics from video clips. Our proposed model jointly captures the visual and language information of a video scene by encoding not only the subtitles but also a sequence of visual concepts with a pre-trained language-based Transformer. In our experiments, we exhaustively study the performance of our model by taking different input arrangements, showing outstanding improvements when compared against previous work on two well-known video VQA datasets: TVQA and Pororo.
Cite
Text
Yang et al. "BERT Representations for Video Question Answering." Winter Conference on Applications of Computer Vision, 2020.Markdown
[Yang et al. "BERT Representations for Video Question Answering." Winter Conference on Applications of Computer Vision, 2020.](https://mlanthology.org/wacv/2020/yang2020wacv-bert/)BibTeX
@inproceedings{yang2020wacv-bert,
title = {{BERT Representations for Video Question Answering}},
author = {Yang, Zekun and Garcia, Noa and Chu, Chenhui and Otani, Mayu and Nakashima, Yuta and Takemura, Haruo},
booktitle = {Winter Conference on Applications of Computer Vision},
year = {2020},
url = {https://mlanthology.org/wacv/2020/yang2020wacv-bert/}
}