Structured Triplet Learning with POS-Tag Guided Attention for Visual Question Answering
Abstract
Visual question answering (VQA) is of significant interest due to its potential to be a strong test of image understanding systems and to probe the connection between language and vision. Despite much recent progress, general VQA is far from a solved problem. In this paper, we focus on the VQA multiple-choice task, and provide some good practices for designing an effective VQA model that can capture language-vision interactions and perform joint reasoning. We explore mechanisms of incorporating part-ofspeech (POS) tag guided attention, convolutional n-grams, triplet attention interactions between the image, question and candidate answer, and structured learning for triplets based on image-question pairs 1. We evaluate our models on two popular datasets: Visual7W and VQA Real Multiple Choice. Our final model achieves the state-of-the-art performance of 68.2% on Visual7W, and a very competitive performance of 69.6% on the test-standard split of VQA Real Multiple Choice.
Cite
Text
Wang et al. "Structured Triplet Learning with POS-Tag Guided Attention for Visual Question Answering." IEEE/CVF Winter Conference on Applications of Computer Vision, 2018. doi:10.1109/WACV.2018.00209Markdown
[Wang et al. "Structured Triplet Learning with POS-Tag Guided Attention for Visual Question Answering." IEEE/CVF Winter Conference on Applications of Computer Vision, 2018.](https://mlanthology.org/wacv/2018/wang2018wacv-structured/) doi:10.1109/WACV.2018.00209BibTeX
@inproceedings{wang2018wacv-structured,
title = {{Structured Triplet Learning with POS-Tag Guided Attention for Visual Question Answering}},
author = {Wang, Zhe and Liu, Xiaoyi and Wang, Limin and Qiao, Yu and Xie, Xiaohui and Fowlkes, Charless C.},
booktitle = {IEEE/CVF Winter Conference on Applications of Computer Vision},
year = {2018},
pages = {1888-1896},
doi = {10.1109/WACV.2018.00209},
url = {https://mlanthology.org/wacv/2018/wang2018wacv-structured/}
}