How Transformers Utilize Multi-Head Attention in In-Context Learning? a Case Study on Sparse Linear Regression

Abstract

In this study, we investigate how a trained multi-head transformer performs in-context learning on sparse linear regression. We experimentally discover distinct patterns in multi-head utilization across layers: multiple heads are essential in the first layer, while subsequent layers predominantly utilize a single head. We propose that the first layer preprocesses input data, while later layers execute simple optimization steps on the preprocessed data. Theoretically, we prove such a preprocess-then-optimize algorithm can outperform naive gradient descent and ridge regression, corroborated by experiments. Our findings provide insights into the benefits of multi-head attention and the intricate mechanisms within trained transformers.

Cite

Text

Chen et al. "How Transformers Utilize Multi-Head Attention in In-Context Learning? a Case Study on Sparse Linear Regression." ICML 2024 Workshops: TF2M, 2024.

Markdown

[Chen et al. "How Transformers Utilize Multi-Head Attention in In-Context Learning? a Case Study on Sparse Linear Regression." ICML 2024 Workshops: TF2M, 2024.](https://mlanthology.org/icmlw/2024/chen2024icmlw-transformers/)

BibTeX

@inproceedings{chen2024icmlw-transformers,
  title     = {{How Transformers Utilize Multi-Head Attention in In-Context Learning? a Case Study on Sparse Linear Regression}},
  author    = {Chen, Xingwu and Zhao, Lei and Zou, Difan},
  booktitle = {ICML 2024 Workshops: TF2M},
  year      = {2024},
  url       = {https://mlanthology.org/icmlw/2024/chen2024icmlw-transformers/}
}