The FFN as a Key-Value Memory: Functional Specialization in Transformer Computation
Abstract
The feed-forward network (FFN) is a central but underexplored component of the transformer architecture. Traditionally viewed as a generic “capacity-adding” block, its precise computational role remains unclear. In this paper, we advance a mechanistic reinterpretation of the FFN as a content-addressable key-value memory composed of sparse specialized circuits. Using a controlled synthetic “conditional computation” task, we demonstrate that self-attention alone does not support input-dependent logic, while non-linear FFN is essential. Analysis of hidden activations reveals a principle of extreme sparsity, quantified via the Gini coefficient, and targeted interventions uncover non-overlapping groups of neurons forming distinct computational circuits (e.g., SUM vs. MAX). We validate this modularity with a targeted ablation experiment that provides definitive causal proof through a classic double dissociation. Further, we identify a hierarchical structure, where generalist neurons route information to task-specialist sub-circuits. Validation in a pre-trained DistilBERT model confirms that these principles extend to real-world language processing, where FFNs exhibit pervasive sparsity and specialization for core linguistic categories such as nouns and verbs. Together, these results go beyond the descriptive analogies of “FFNs as memory” and provide systematic and structural evidence that FFN is a dynamic computational engine that underpins the success of transformer models.
Cite
Text
Rahman et al. "The FFN as a Key-Value Memory: Functional Specialization in Transformer Computation." Machine Learning, 2026. doi:10.1007/S10994-025-06948-1Markdown
[Rahman et al. "The FFN as a Key-Value Memory: Functional Specialization in Transformer Computation." Machine Learning, 2026.](https://mlanthology.org/mlj/2026/rahman2026mlj-ffn/) doi:10.1007/S10994-025-06948-1BibTeX
@article{rahman2026mlj-ffn,
title = {{The FFN as a Key-Value Memory: Functional Specialization in Transformer Computation}},
author = {Rahman, Zaryab and Din, Fakhrud and Khalid, Shah and Karthikeyan, Rishi},
journal = {Machine Learning},
year = {2026},
pages = {2},
doi = {10.1007/S10994-025-06948-1},
volume = {115},
url = {https://mlanthology.org/mlj/2026/rahman2026mlj-ffn/}
}