Transformers as Support Vector Machines

Abstract

The transformer architecture has led to revolutionary advancements in NLP. The attention layer within the transformer admits a sequence of input tokens $X$ and makes them interact through pairwise similarities computed as $\texttt{softmax}(XQK^\top X^\top)$, where $(K,Q)$ are the trainable key-query parameters. In this work, we establish a formal equivalence between the optimization geometry of self-attention and a hard-margin SVM problem that separates optimal input tokens from non-optimal tokens using linear constraints on the outer-products of token pairs. This formalism allows us to characterize the implicit bias of 1-layer transformers optimized with gradient descent: (1) Optimizing the attention layer, parameterized by $(K,Q)$, with vanishing regularization, converges in direction to an SVM solution minimizing the nuclear norm of the combined parameter $W:=KQ^\top$. Instead, directly parameterizing by $W$ minimizes a Frobenius norm SVM objective. (2) Complementing this, for $W$-parameterization, we prove the local/global directional convergence of gradient descent under suitable geometric conditions, and propose a more general SVM equivalence that predicts the implicit bias of attention with nonlinear heads/MLPs.

PDF NeurIPSW OpenReview Semantic Scholar

Cite

Text

Tarzanagh et al. "Transformers as Support Vector Machines." NeurIPS 2023 Workshops: M3L, 2023.

Markdown

[Tarzanagh et al. "Transformers as Support Vector Machines." NeurIPS 2023 Workshops: M3L, 2023.](https://mlanthology.org/neuripsw/2023/tarzanagh2023neuripsw-transformers/)

BibTeX

@inproceedings{tarzanagh2023neuripsw-transformers,
  title     = {{Transformers as Support Vector Machines}},
  author    = {Tarzanagh, Davoud Ataee and Li, Yingcong and Thrampoulidis, Christos and Oymak, Samet},
  booktitle = {NeurIPS 2023 Workshops: M3L},
  year      = {2023},
  url       = {https://mlanthology.org/neuripsw/2023/tarzanagh2023neuripsw-transformers/}
}