Always Skip Attention

Abstract

We highlight a curious empirical result within modern Vision Transformers (ViTs). Specifically, self-attention catastrophically fails to train unless it is used in conjunction with a skip connection. This is in contrast to other elements of a ViT that continue to exhibit good performance (albeit suboptimal) when skip connections are removed. Further, we show that this critical dependence on skip connections is a relatively new phenomenon, with previous deep architectures (e.g., CNNs) exhibiting good performance in their absence. In this paper, we theoretically characterize that the self-attention mechanism is fundamentally ill-conditioned and is, therefore, uniquely dependent on skip connections for regularization. Additionally, we propose Token Graying (TG), a simple yet effective complement (to skip connections) that further improves the condition of in- put tokens. We validate our approach in both supervised and self-supervised training methods

Cite

Text

Ji et al. "Always Skip Attention." International Conference on Computer Vision, 2025.

Markdown

[Ji et al. "Always Skip Attention." International Conference on Computer Vision, 2025.](https://mlanthology.org/iccv/2025/ji2025iccv-always/)

BibTeX

@inproceedings{ji2025iccv-always,
  title     = {{Always Skip Attention}},
  author    = {Ji, Yiping and Saratchandran, Hemanth and Moghadam, Peyman and Lucey, Simon},
  booktitle = {International Conference on Computer Vision},
  year      = {2025},
  pages     = {23115-23123},
  url       = {https://mlanthology.org/iccv/2025/ji2025iccv-always/}
}