A Is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders

Abstract

Sparse Autoencoders (SAEs) have emerged as a promising approach to decom- pose the activations of Large Language Models (LLMs) into human-interpretable components. But to what extent do SAEs extract monosemantic and interpretable latents? We systematically evaluate precision and recall of a large number of SAEs with varying width and sparsity on a first-letter identification task, where we have complete access to ground truth labels for all tokens in the vocabulary. Critically, we identify a problematic form of feature-splitting we call “feature absorption” where seemingly monosemantic latents fail to fire in cases where they apparently should. Our investigation suggests that varying SAE size or sparsity is insufficient to solve this issue, and that this is a more fundamental problem related to promoting sparsity in the presence of co-occurring features.

Cite

Text

Chanin et al. "A Is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders." NeurIPS 2024 Workshops: InterpretableAI, 2024.

Markdown

[Chanin et al. "A Is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders." NeurIPS 2024 Workshops: InterpretableAI, 2024.](https://mlanthology.org/neuripsw/2024/chanin2024neuripsw-absorption/)

BibTeX

@inproceedings{chanin2024neuripsw-absorption,
  title     = {{A Is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders}},
  author    = {Chanin, David and Wilken-Smith, James and Dulka, Tomáš and Bhatnagar, Hardik and Bloom, Joseph Isaac},
  booktitle = {NeurIPS 2024 Workshops: InterpretableAI},
  year      = {2024},
  url       = {https://mlanthology.org/neuripsw/2024/chanin2024neuripsw-absorption/}
}