A Theoretical Understanding of Shallow Vision Transformers: Learning, Generalization, and Sample Complexity

Hongkang Li, Meng Wang, Sijia Liu, Pin-Yu Chen

ICLR 2023

/iclr/2023/li2023iclr-theoretical/

Abstract

Vision Transformers (ViTs) with self-attention modules have recently achieved great empirical success in many vision tasks. Due to non-convex interactions across layers, however, the theoretical learning and generalization analysis is mostly elusive. Based on a data model characterizing both label-relevant and label-irrelevant tokens, this paper provides the first theoretical analysis of training a three-layer ViT, i.e., one self-attention layer followed by a two-layer perceptron, for a classification task. We characterize the sample complexity to achieve a zero generalization error. Our sample complexity bound is positively correlated with the inverse of the fraction of label-relevant tokens, the token noise level, and the initial model error. We also prove that a training process using stochastic gradient descent (SGD) leads to a sparse attention map, which is a formal verification of the general intuition about the success of attention. Moreover, this paper indicates that a proper token sparsification can improve the test performance by removing label-irrelevant and/or noisy tokens, including spurious correlations. Empirical experiments on synthetic data and CIFAR-10 dataset justify our theoretical results and generalize to deeper ViTs.

PDF ICLR Semantic Scholar

Cite

Text

Li et al. "A Theoretical Understanding of Shallow Vision Transformers: Learning, Generalization, and Sample Complexity." International Conference on Learning Representations, 2023.

Markdown

[Li et al. "A Theoretical Understanding of Shallow Vision Transformers: Learning, Generalization, and Sample Complexity." International Conference on Learning Representations, 2023.](https://mlanthology.org/iclr/2023/li2023iclr-theoretical/)

BibTeX

@inproceedings{li2023iclr-theoretical,
  title     = {{A Theoretical Understanding of Shallow Vision Transformers: Learning, Generalization, and Sample Complexity}},
  author    = {Li, Hongkang and Wang, Meng and Liu, Sijia and Chen, Pin-Yu},
  booktitle = {International Conference on Learning Representations},
  year      = {2023},
  url       = {https://mlanthology.org/iclr/2023/li2023iclr-theoretical/}
}