Dual-Windowed Vision Transformer with Angular Self- Attention

Abstract

Following the great success in natural language processing, transformer-based models have emerged as the competitive model against the convolutional neural networks in computer vision. Vision transformer (ViT) and its subsequent variants have exhibited promising performance in tasks such as image classification, object detection and semantic segmentation. The core of vision transformers is the self-attention mechanism, which models the long-range dependency of different tokens. Conventionally, the attention matrix in self-attention is calculated by the scaled dot-product of \textit{query} (Q) and \textit{key} (K). In this case, the attention weight would depend on norm of Q and K as well as the angle between them. In this paper, we propose a new attention mechanism named angular self-attention, which replaces the scaled dot-product operation with the angular function in order to effectively model the relationship between tokens. In particular, we propose two forms of functions: quadratic and cosine functions, for our angular self-attention. Based on angular self-attention, we design a new vision transformer architecture called dual-windowed angular vision transformer (\textbf{DWAViT}). DWAViT is a hierarchical-structured model characterized by the angular self-attention and a new local window mechanism. We evaluate DWAViT on multiple computer vision benchmarks, including image classification on ImageNet-1K, object detection on COCO, and semantic segmentation on ADE20K. Our experimental results also suggest that our model can achieve promising performance on the tasks while maintaining comparable computational cost with that of the baseline models (e.g., Swin Transformer).

Cite

Text

Shi and Li. "Dual-Windowed Vision Transformer with Angular Self- Attention." Transactions on Machine Learning Research, 2024.

Markdown

[Shi and Li. "Dual-Windowed Vision Transformer with Angular Self- Attention." Transactions on Machine Learning Research, 2024.](https://mlanthology.org/tmlr/2024/shi2024tmlr-dualwindowed/)

BibTeX

@article{shi2024tmlr-dualwindowed,
  title     = {{Dual-Windowed Vision Transformer with Angular Self- Attention}},
  author    = {Shi, Weili and Li, Sheng},
  journal   = {Transactions on Machine Learning Research},
  year      = {2024},
  url       = {https://mlanthology.org/tmlr/2024/shi2024tmlr-dualwindowed/}
}