Semantically Guided Visual Question Answering

Abstract

We present a novel approach to enhance the challenging task of Visual Question Answering (VQA) by incorporating and enriching semantic knowledge in a VQA model. We first apply Multiple Instance Learning (MIL) to extract a richer visual representation addressing concepts beyond objects such as actions and colors. Motivated by the observation that semantically related answers often appear together in prediction, we further develop a new semantically-guided loss function for model learning which has the potential to drive weakly-scored but correct answers to the top while suppressing wrong answers. We show that these two ideas contribute to performance improvement in a complementary way. We demonstrate competitive results comparable to the state of the art on two VQA benchmark datasets.

Cite

Text

Zhao et al. "Semantically Guided Visual Question Answering." IEEE/CVF Winter Conference on Applications of Computer Vision, 2018. doi:10.1109/WACV.2018.00205

Markdown

[Zhao et al. "Semantically Guided Visual Question Answering." IEEE/CVF Winter Conference on Applications of Computer Vision, 2018.](https://mlanthology.org/wacv/2018/zhao2018wacv-semantically/) doi:10.1109/WACV.2018.00205

BibTeX

@inproceedings{zhao2018wacv-semantically,
  title     = {{Semantically Guided Visual Question Answering}},
  author    = {Zhao, Handong and Fan, Quanfu and Gutfreund, Dan and Fu, Yun},
  booktitle = {IEEE/CVF Winter Conference on Applications of Computer Vision},
  year      = {2018},
  pages     = {1852-1860},
  doi       = {10.1109/WACV.2018.00205},
  url       = {https://mlanthology.org/wacv/2018/zhao2018wacv-semantically/}
}