How Transformers Utilize Multi-Head Attention in In-Context Learning? a Case Study on Sparse Linear Regression

Abstract

Despite the remarkable success of transformer-based models in various real-world tasks, their underlying mechanisms remain poorly understood. Recent studies have suggested that transformers can implement gradient descent as an in-context learner for linear regression problems and have developed various theoretical analyses accordingly. However, these works mostly focus on the expressive power of transformers by designing specific parameter constructions, lacking a comprehensive understanding of their inherent working mechanisms post-training. In this study, we consider a sparse linear regression problem and investigate how a trained multi-head transformer performs in-context learning. We experimentally discover that the utilization of multi-heads exhibits different patterns across layers: multiple heads are utilized and essential in the first layer, while usually only a single head is sufficient for subsequent layers. We provide a theoretical explanation for this observation: the first layer preprocesses the context data, and the following layers execute simple optimization steps based on the preprocessed context. Moreover, we demonstrate that such a preprocess-then-optimize algorithm can significantly outperform naive gradient descent and ridge regression algorithms. Further experimental results support our explanations. Our findings offer insights into the benefits of multi-head attention and contribute to understanding the more intricate mechanisms hidden within trained transformers.

PDF NeurIPS OpenReview Semantic Scholar

Cite

Text

Chen et al. "How Transformers Utilize Multi-Head Attention in In-Context Learning? a Case Study on Sparse Linear Regression." Neural Information Processing Systems, 2024. doi:10.52202/079017-3799

Markdown

[Chen et al. "How Transformers Utilize Multi-Head Attention in In-Context Learning? a Case Study on Sparse Linear Regression." Neural Information Processing Systems, 2024.](https://mlanthology.org/neurips/2024/chen2024neurips-transformers/) doi:10.52202/079017-3799

BibTeX

@inproceedings{chen2024neurips-transformers,
  title     = {{How Transformers Utilize Multi-Head Attention in In-Context Learning? a Case Study on Sparse Linear Regression}},
  author    = {Chen, Xingwu and Zhao, Lei and Zou, Difan},
  booktitle = {Neural Information Processing Systems},
  year      = {2024},
  doi       = {10.52202/079017-3799},
  url       = {https://mlanthology.org/neurips/2024/chen2024neurips-transformers/}
}