Top-Down Visual Attention from Analysis by Synthesis

Abstract

Current attention algorithms (e.g., self-attention) are stimulus-driven and highlight all the salient objects in an image. However, intelligent agents like humans often guide their attention based on the high-level task at hand, focusing only on task-related objects. This ability of task-guided top-down attention provides task-adaptive representation and helps the model generalize to various tasks. In this paper, we consider top-down attention from a classic Analysis-by-Synthesis (AbS) perspective of vision. Prior work indicates a functional equivalence between visual attention and sparse reconstruction; we show that an AbS visual system that optimizes a similar sparse reconstruction objective modulated by a goal-directed top-down signal naturally simulates top-down attention. We further propose Analysis-by-Synthesis Vision Transformer (AbSViT), which is a top-down modulated ViT model that variationally approximates AbS, and achieves controllable top-down attention. For real-world applications, AbSViT consistently improves over baselines on Vision-Language tasks such as VQA and zero-shot retrieval where language guides the top-down attention. AbSViT can also serve as a general backbone, improving performance on classification, semantic segmentation, and model robustness. Project page: https://sites.google.com/view/absvit.

Cite

Text

Shi et al. "Top-Down Visual Attention from Analysis by Synthesis." Conference on Computer Vision and Pattern Recognition, 2023. doi:10.1109/CVPR52729.2023.00209

Markdown

[Shi et al. "Top-Down Visual Attention from Analysis by Synthesis." Conference on Computer Vision and Pattern Recognition, 2023.](https://mlanthology.org/cvpr/2023/shi2023cvpr-topdown/) doi:10.1109/CVPR52729.2023.00209

BibTeX

@inproceedings{shi2023cvpr-topdown,
  title     = {{Top-Down Visual Attention from Analysis by Synthesis}},
  author    = {Shi, Baifeng and Darrell, Trevor and Wang, Xin},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2023},
  pages     = {2102-2112},
  doi       = {10.1109/CVPR52729.2023.00209},
  url       = {https://mlanthology.org/cvpr/2023/shi2023cvpr-topdown/}
}