Optimizing Attention with Mirror Descent: Generalized Max-Margin Token Selection

Julistiono, Aaron Alvarado Kristanto; Tarzanagh, Davoud Ataee; Azizan, Navid

Optimizing Attention with Mirror Descent: Generalized Max-Margin Token Selection

Aaron Alvarado Kristanto Julistiono, Davoud Ataee Tarzanagh, Navid Azizan

NeurIPSW 2024

/neuripsw/2024/julistiono2024neuripsw-optimizing/

Abstract

Attention mechanisms have revolutionized numerous domains of artificial intelligence, including natural language processing and computer vision, by enabling models to selectively focus on relevant parts of the input data. Building on recent results characterizing the optimization dynamics of gradient descent (GD) and the structural properties of its preferred solutions in attention-based models, this paper explores the convergence properties and implicit bias of a family of mirror descent (MD) algorithms designed for softmax attention mechanisms, with the potential function chosen as the $p$-th power of the $\ell_p$-norm. Specifically, we show the directional convergence of these algorithms to a generalized hard-margin SVM with an $\ell_p$-norm objective when applied to a classification problem using a one-layer softmax attention model. Our theoretical results demonstrate that these algorithms not only converge directionally to the generalized max-margin solutions but also do so at a rate comparable to that of traditional GD in simpler models, despite the highly nonlinear and nonconvex nature of the present problem. Additionally, we delve into the joint optimization dynamics of the key-query matrix and the decoder, establishing conditions under which this complex joint optimization converges to their respective hard-margin SVM solutions.

PDF NeurIPSW OpenReview Semantic Scholar

Cite

Text

Julistiono et al. "Optimizing Attention with Mirror Descent: Generalized Max-Margin Token Selection." NeurIPS 2024 Workshops: M3L, 2024.

Markdown

[Julistiono et al. "Optimizing Attention with Mirror Descent: Generalized Max-Margin Token Selection." NeurIPS 2024 Workshops: M3L, 2024.](https://mlanthology.org/neuripsw/2024/julistiono2024neuripsw-optimizing/)

BibTeX

@inproceedings{julistiono2024neuripsw-optimizing,
  title     = {{Optimizing Attention with Mirror Descent: Generalized Max-Margin Token Selection}},
  author    = {Julistiono, Aaron Alvarado Kristanto and Tarzanagh, Davoud Ataee and Azizan, Navid},
  booktitle = {NeurIPS 2024 Workshops: M3L},
  year      = {2024},
  url       = {https://mlanthology.org/neuripsw/2024/julistiono2024neuripsw-optimizing/}
}