Attention Mechanism, Max-Affine Partition, and Universal Approximation

Abstract

We establish the universal approximation capability of single-layer, single-head self- and cross-attention mechanisms with minimal attached structures. Our key insight is to interpret single-head attention as an input domain-partition mechanism that assigns distinct values to subregions. This allows us to engineer the attention weights such that this assignment imitates the target function. Building on this, we prove that a single self-attention layer, preceded by sum-of-linear transformations, is capable of approximating any continuous function on a compact domain under the $L_\infty$-norm. Furthermore, we extend this construction to approximate any Lebesgue integrable function under $L_p$-norm for $1\leq p <\infty$. Lastly, we also extend our techniques and show that, for the first time, single-head cross-attention achieves the same universal approximation guarantees.

Cite

Text

Liu et al. "Attention Mechanism, Max-Affine Partition, and Universal Approximation." Advances in Neural Information Processing Systems, 2025.

Markdown

[Liu et al. "Attention Mechanism, Max-Affine Partition, and Universal Approximation." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/liu2025neurips-attention/)

BibTeX

@inproceedings{liu2025neurips-attention,
  title     = {{Attention Mechanism, Max-Affine Partition, and Universal Approximation}},
  author    = {Liu, Hude and Hu, Jerry Yao-Chieh and Song, Zhao and Liu, Han},
  booktitle = {Advances in Neural Information Processing Systems},
  year      = {2025},
  url       = {https://mlanthology.org/neurips/2025/liu2025neurips-attention/}
}