Structured Two-Stream Attention Network for Video Question Answering
Abstract
To date, visual question answering (VQA) (i.e., image QA and video QA) is still a holy grail in vision and language understanding, especially for video QA. Compared with image QA that focuses primarily on understanding the associations between image region-level details and corresponding questions, video QA requires a model to jointly reason across both spatial and long-range temporal structures of a video as well as text to provide an accurate answer. In this paper, we specifically tackle the problem of video QA by proposing a Structured Two-stream Attention network, namely STA, to answer a free-form or open-ended natural language question about the content of a given video. First, we infer rich longrange temporal structures in videos using our structured segment component and encode text features. Then, our structured two-stream attention component simultaneously localizes important visual instance, reduces the influence of background video and focuses on the relevant text. Finally, the structured two-stream fusion component incorporates different segments of query and video aware context representation and infers the answers. Experiments on the large-scale video QA dataset TGIF-QA show that our proposed method significantly surpasses the best counterpart (i.e., with one representation for the video input) by 13.0%, 13.5%, 11.0% and 0.3 for Action, Trans., TrameQA and Count tasks. It also outperforms the best competitor (i.e., with two representations) on the Action, Trans., TrameQA tasks by 4.1%, 4.7%, and 5.1%.
Cite
Text
Gao et al. "Structured Two-Stream Attention Network for Video Question Answering." AAAI Conference on Artificial Intelligence, 2019. doi:10.1609/AAAI.V33I01.33016391Markdown
[Gao et al. "Structured Two-Stream Attention Network for Video Question Answering." AAAI Conference on Artificial Intelligence, 2019.](https://mlanthology.org/aaai/2019/gao2019aaai-structured/) doi:10.1609/AAAI.V33I01.33016391BibTeX
@inproceedings{gao2019aaai-structured,
title = {{Structured Two-Stream Attention Network for Video Question Answering}},
author = {Gao, Lianli and Zeng, Pengpeng and Song, Jingkuan and Li, Yuan-Fang and Liu, Wu and Mei, Tao and Shen, Heng Tao},
booktitle = {AAAI Conference on Artificial Intelligence},
year = {2019},
pages = {6391-6398},
doi = {10.1609/AAAI.V33I01.33016391},
url = {https://mlanthology.org/aaai/2019/gao2019aaai-structured/}
}