Hebbian Sparse Autoencoder
Abstract
We establish a theoretical and empirical connection between Hebbian Winner-Take-All (WTA) learning with anti-Hebbian updates and tied-weight sparse autoencoders (SAEs), offering a framework to explain the high selectivity of neurons to patterns induced by biologically inspired learning rules. By training a SAE on token embeddings of a small language model using a gradient-free Hebbian WTA rule with competitive anti-Hebbian plasticity, we demonstrate that such methods implicitly optimize SAE objectives. However, they underperform backpropagation SAEs in reconstruction due to gradient approximations. Hebbian updates approximate reconstruction error (MSE) minimization under tied weights, while anti-Hebbian updates enforce sparsity/feature orthogonality, akin to explicit L1/L2 penalties in standard SAEs. This alignment with the superposition hypothesis (Elhage et al., 2022) reveals how Hebbian rules disentangle features in overcomplete latent spaces, marking the first application of Hebbian learning to SAEs for language model interpretability.
Cite
Text
Kurdiukov and Razzhigaev. "Hebbian Sparse Autoencoder." ICLR 2025 Workshops: NFAM, 2025.Markdown
[Kurdiukov and Razzhigaev. "Hebbian Sparse Autoencoder." ICLR 2025 Workshops: NFAM, 2025.](https://mlanthology.org/iclrw/2025/kurdiukov2025iclrw-hebbian/)BibTeX
@inproceedings{kurdiukov2025iclrw-hebbian,
title = {{Hebbian Sparse Autoencoder}},
author = {Kurdiukov, Nikita and Razzhigaev, Anton},
booktitle = {ICLR 2025 Workshops: NFAM},
year = {2025},
url = {https://mlanthology.org/iclrw/2025/kurdiukov2025iclrw-hebbian/}
}