Multimodal Integration of Human-like Attention in Visual Question Answering

Abstract

Human-like attention as a supervisory signal to guide neural attention has shown significant promise but is currently limited to unimodal integration – even for inherently multimodal tasks such as visual question answering (VQA). We present the Multimodal Human-like Attention Network (MULAN) – the first method for multimodal integration of human-like attention on image and text during training of VQA models. MULAN integrates attention predictions from two state-of-the-art text and image saliency models into neural self-attention layers of a recent transformer-based VQA model. Through evaluations on the challenging VQAv2 dataset, we show that MULAN is competitive to state of the art in its model class – achieving 73.98% accuracy on test-std and 73.72% on test-dev with approximately 80% fewer trainable parameters than prior work. Overall, our work underlines the potential of integrating multimodal human-like attention into neural attention mechanisms for VQA.

Cite

Text

Sood et al. "Multimodal Integration of Human-like Attention in Visual Question Answering." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2023. doi:10.1109/CVPRW59228.2023.00265

Markdown

[Sood et al. "Multimodal Integration of Human-like Attention in Visual Question Answering." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2023.](https://mlanthology.org/cvprw/2023/sood2023cvprw-multimodal/) doi:10.1109/CVPRW59228.2023.00265

BibTeX

@inproceedings{sood2023cvprw-multimodal,
  title     = {{Multimodal Integration of Human-like Attention in Visual Question Answering}},
  author    = {Sood, Ekta and Kögel, Fabian and Müller, Philipp and Thomas, Dominike and Bâce, Mihai and Bulling, Andreas},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops},
  year      = {2023},
  pages     = {2648-2658},
  doi       = {10.1109/CVPRW59228.2023.00265},
  url       = {https://mlanthology.org/cvprw/2023/sood2023cvprw-multimodal/}
}