Distributional Associations vs In-Context Reasoning: A Study of Feed-Forward and Attention Layers

Abstract

Large language models have been successful at tasks involving basic forms of in-context reasoning, such as generating coherent language, as well as storing vast amounts of knowledge. At the core of the Transformer architecture behind such models are feed-forward and attention layers, which are often associated to knowledge and reasoning, respectively. In this paper, we study this distinction empirically and theoretically in a controlled synthetic setting where certain next-token predictions involve both distributional and in-context information. We find that feed-forward layers tend to learn simple distributional associations such as bigrams, while attention layers focus on in-context reasoning. Our theoretical analysis identifies the noise in the gradients as a key factor behind this discrepancy. Finally, we illustrate how similar disparities emerge in pre-trained models through ablations on the Pythia model family on simple reasoning tasks.

Cite

Text

Chen et al. "Distributional Associations vs In-Context Reasoning: A Study of Feed-Forward and Attention Layers." International Conference on Learning Representations, 2025.

Markdown

[Chen et al. "Distributional Associations vs In-Context Reasoning: A Study of Feed-Forward and Attention Layers." International Conference on Learning Representations, 2025.](https://mlanthology.org/iclr/2025/chen2025iclr-distributional/)

BibTeX

@inproceedings{chen2025iclr-distributional,
  title     = {{Distributional Associations vs In-Context Reasoning: A Study of Feed-Forward and Attention Layers}},
  author    = {Chen, Lei and Bruna, Joan and Bietti, Alberto},
  booktitle = {International Conference on Learning Representations},
  year      = {2025},
  url       = {https://mlanthology.org/iclr/2025/chen2025iclr-distributional/}
}