On the Optimization and Generalization of Multi-Head Attention
Abstract
The training and generalization dynamics of the Transformer's core mechanism, namely the Attention mechanism, remain under-explored. Besides, existing analyses primarily focus on single-head attention. Inspired by the demonstrated benefits of overparameterization when training fully-connected networks, we investigate the potential optimization and generalization advantages of using multiple attention heads. Towards this goal, we derive convergence and generalization guarantees for gradient-descent training of a single-layer multi-head self-attention model, under a suitable realizability condition on the data. We then establish primitive conditions on the initialization that ensure realizability holds. Finally, we demonstrate that these conditions are satisfied for a simple tokenized-mixture model. We expect the analysis can be extended to various data-model and architecture variations.
Cite
Text
Deora et al. "On the Optimization and Generalization of Multi-Head Attention." Transactions on Machine Learning Research, 2024.Markdown
[Deora et al. "On the Optimization and Generalization of Multi-Head Attention." Transactions on Machine Learning Research, 2024.](https://mlanthology.org/tmlr/2024/deora2024tmlr-optimization/)BibTeX
@article{deora2024tmlr-optimization,
title = {{On the Optimization and Generalization of Multi-Head Attention}},
author = {Deora, Puneesh and Ghaderi, Rouzbeh and Taheri, Hossein and Thrampoulidis, Christos},
journal = {Transactions on Machine Learning Research},
year = {2024},
url = {https://mlanthology.org/tmlr/2024/deora2024tmlr-optimization/}
}