Multi-Question Learning for Visual Question Answering
Abstract
Visual Question Answering (VQA) raises a great challenge for computer vision and natural language processing communities. Most of the existing approaches consider video-question pairs individually during training. However, we observe that there are usually multiple (either sequentially generated or not) questions for the target video in a VQA task, and the questions themselves have abundant semantic relations. To explore these relations, we propose a new paradigm for VQA termed Multi-Question Learning (MQL). Inspired by the multi-task learning, MQL learns from multiple questions jointly together with their corresponding answers for a target video sequence. The learned representations of video-question pairs are then more general to be transferred for new questions. We further propose an effective VQA framework and design a training procedure for MQL, where the specifically designed attention network models the relation between input video and corresponding questions, enabling multiple video-question pairs to be co-trained. Experimental results on public datasets show the favorable performance of the proposed MQL-VQA framework compared to state-of-the-arts.
Cite
Text
Lei et al. "Multi-Question Learning for Visual Question Answering." AAAI Conference on Artificial Intelligence, 2020. doi:10.1609/AAAI.V34I07.6794Markdown
[Lei et al. "Multi-Question Learning for Visual Question Answering." AAAI Conference on Artificial Intelligence, 2020.](https://mlanthology.org/aaai/2020/lei2020aaai-multi/) doi:10.1609/AAAI.V34I07.6794BibTeX
@inproceedings{lei2020aaai-multi,
title = {{Multi-Question Learning for Visual Question Answering}},
author = {Lei, Chenyi and Wu, Lei and Liu, Dong and Li, Zhao and Wang, Guoxin and Tang, Haihong and Li, Houqiang},
booktitle = {AAAI Conference on Artificial Intelligence},
year = {2020},
pages = {11328-11335},
doi = {10.1609/AAAI.V34I07.6794},
url = {https://mlanthology.org/aaai/2020/lei2020aaai-multi/}
}