Predicting Human Scanpaths in Visual Question Answering

Abstract

Attention has been an important mechanism for both humans and computer vision systems. While state-of-the-art models to predict attention focus on estimating a static probabilistic saliency map with free-viewing behavior, real-life scenarios are filled with tasks of varying types and complexities, and visual exploration is a temporal process that contributes to task performance. To bridge the gap, we conduct a first study to understand and predict the temporal sequences of eye fixations (a.k.a. scanpaths) during performing general tasks, and examine how scanpaths affect task performance. We present a new deep reinforcement learning method to predict scanpaths leading to different performances in visual question answering. Conditioned on a task guidance map, the proposed model learns question-specific attention patterns to generate scanpaths. It addresses the exposure bias in scanpath prediction with self-critical sequence training and designs a Consistency-Divergence loss to generate distinguishable scanpaths between correct and incorrect answers. The proposed model not only accurately predicts the spatio-temporal patterns of human behavior in visual question answering, such as fixation position, duration, and order, but also generalizes to free-viewing and visual search tasks, achieving human-level performance in all tasks and significantly outperforming the state of the art.

Cite

Text

Chen et al. "Predicting Human Scanpaths in Visual Question Answering." Conference on Computer Vision and Pattern Recognition, 2021. doi:10.1109/CVPR46437.2021.01073

Markdown

[Chen et al. "Predicting Human Scanpaths in Visual Question Answering." Conference on Computer Vision and Pattern Recognition, 2021.](https://mlanthology.org/cvpr/2021/chen2021cvpr-predicting/) doi:10.1109/CVPR46437.2021.01073

BibTeX

@inproceedings{chen2021cvpr-predicting,
  title     = {{Predicting Human Scanpaths in Visual Question Answering}},
  author    = {Chen, Xianyu and Jiang, Ming and Zhao, Qi},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2021},
  pages     = {10876-10885},
  doi       = {10.1109/CVPR46437.2021.01073},
  url       = {https://mlanthology.org/cvpr/2021/chen2021cvpr-predicting/}
}