Bandit Based Attention Mechanism in Vision Transformers

Abstract

Vision Transformers (ViT) have demonstrated remarkable performance on many computer vision tasks. However their high computational cost and quadratic complexity pose challenges for deployment in resource-constrained environments. The core of Vision Transformers is the self-attention mechanism which aggregates information from different image regions or patches. In a conventional ViT processing involves attention to all patches creating a substantial computational bottleneck and extended training times. We hypothesize that applying soft attention to all patches may be unnecessary and instead focusing on relevant and significant patches (hard attention) would be sufficient. To address this we introduce a module within the Vision Transformer that allows the attention mechanism to selectively process only the essential patches. We propose a novel bandit-based attention mechanism that leverages the idea of exploration and exploitation. The extensive experimentation across various datasets illustrates that the proposed bandit attention-based ViT not only achieves superior performance compared to the existing state-of-the-art vision transformer models but also results in greater throughput and lower computational time in the training as well as the inference. The code is publicly available at https://github.com/aquorio15/bandit wacv

Cite

Text

Chowdhury et al. "Bandit Based Attention Mechanism in Vision Transformers." Winter Conference on Applications of Computer Vision, 2025.

Markdown

[Chowdhury et al. "Bandit Based Attention Mechanism in Vision Transformers." Winter Conference on Applications of Computer Vision, 2025.](https://mlanthology.org/wacv/2025/chowdhury2025wacv-bandit/)

BibTeX

@inproceedings{chowdhury2025wacv-bandit,
  title     = {{Bandit Based Attention Mechanism in Vision Transformers}},
  author    = {Chowdhury, Amartya Roy and Diddigi, Raghuram Bharadwaj and Prabuchandran, K J and Tripathi, Achyut Mani},
  booktitle = {Winter Conference on Applications of Computer Vision},
  year      = {2025},
  pages     = {9579-9588},
  url       = {https://mlanthology.org/wacv/2025/chowdhury2025wacv-bandit/}
}