Distributional Associations vs In-Context Reasoning: A Study of Feed-Forward and Attention Layers
Abstract
Large language models have been successful at tasks involving basic forms of in-context reasoning, such as generating coherent language, as well as storing vast amounts of knowledge. At the core of the Transformer architecture behind such models are feed-forward and attention layers, which are often associated to knowledge and reasoning, respectively. In this paper, we study this distinction empirically and theoretically in a controlled synthetic setting where certain next-token predictions involve both distributional and in-context information. We find that feed-forward layers tend to learn simple distributional associations such as bigrams, while attention layers focus on in-context reasoning. Our theoretical analysis identifies the noise in the gradients as a key factor behind this discrepancy. Finally, we illustrate how similar disparities emerge in pre-trained models through ablations on the Pythia model family on simple reasoning tasks.
Cite
Text
Chen et al. "Distributional Associations vs In-Context Reasoning: A Study of Feed-Forward and Attention Layers." International Conference on Learning Representations, 2025.Markdown
[Chen et al. "Distributional Associations vs In-Context Reasoning: A Study of Feed-Forward and Attention Layers." International Conference on Learning Representations, 2025.](https://mlanthology.org/iclr/2025/chen2025iclr-distributional/)BibTeX
@inproceedings{chen2025iclr-distributional,
title = {{Distributional Associations vs In-Context Reasoning: A Study of Feed-Forward and Attention Layers}},
author = {Chen, Lei and Bruna, Joan and Bietti, Alberto},
booktitle = {International Conference on Learning Representations},
year = {2025},
url = {https://mlanthology.org/iclr/2025/chen2025iclr-distributional/}
}