How Transformers Utilize Multi-Head Attention in In-Context Learning? a Case Study on Sparse Linear Regression
Abstract
In this study, we investigate how a trained multi-head transformer performs in-context learning on sparse linear regression. We experimentally discover distinct patterns in multi-head utilization across layers: multiple heads are essential in the first layer, while subsequent layers predominantly utilize a single head. We propose that the first layer preprocesses input data, while later layers execute simple optimization steps on the preprocessed data. Theoretically, we prove such a preprocess-then-optimize algorithm can outperform naive gradient descent and ridge regression, corroborated by experiments. Our findings provide insights into the benefits of multi-head attention and the intricate mechanisms within trained transformers.
Cite
Text
Chen et al. "How Transformers Utilize Multi-Head Attention in In-Context Learning? a Case Study on Sparse Linear Regression." ICML 2024 Workshops: TF2M, 2024.Markdown
[Chen et al. "How Transformers Utilize Multi-Head Attention in In-Context Learning? a Case Study on Sparse Linear Regression." ICML 2024 Workshops: TF2M, 2024.](https://mlanthology.org/icmlw/2024/chen2024icmlw-transformers/)BibTeX
@inproceedings{chen2024icmlw-transformers,
title = {{How Transformers Utilize Multi-Head Attention in In-Context Learning? a Case Study on Sparse Linear Regression}},
author = {Chen, Xingwu and Zhao, Lei and Zou, Difan},
booktitle = {ICML 2024 Workshops: TF2M},
year = {2024},
url = {https://mlanthology.org/icmlw/2024/chen2024icmlw-transformers/}
}