Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control

Abstract

Disentangling model activations into human-interpretable features is a central problem in interpretability. Sparse autoencoders (SAEs) have recently attracted much attention as a scalable unsupervised approach to this problem. However, our imprecise understanding of ground-truth features in realistic scenarios makes it difficult to measure the success of SAEs. To address this challenge, we propose to evaluate SAEs on specific tasks by comparing them to supervised feature dictionaries computed with knowledge of the concepts relevant to the task. Specifically, we suggest that it is possible to (1) compute supervised sparse feature dictionaries that disentangle model computations for a specific task; (2) use them to evaluate and contextualize the degree of disentanglement and control offered by SAE latents on this task. Importantly, we can do this in a way that is agnostic to whether the SAEs have learned the exact ground-truth features or a different but similarly useful representation. As a case study, we apply this framework to the indirect object identification (IOI) task using GPT-2 Small, with SAEs trained on either the IOI or OpenWebText datasets. We find that SAEs capture interpretable features for the IOI task, and that more recent SAE variants such as Gated SAEs and Top-K SAEs are competitive with supervised features in terms of disentanglement and control over the model. We also exhibit, through this setup and toy models, some qualitative phenomena in SAE training illustrating feature splitting and the role of feature magnitudes in solutions preferred by SAEs.

Cite

Text

Makelov et al. "Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control." International Conference on Learning Representations, 2025.

Markdown

[Makelov et al. "Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control." International Conference on Learning Representations, 2025.](https://mlanthology.org/iclr/2025/makelov2025iclr-principled/)

BibTeX

@inproceedings{makelov2025iclr-principled,
  title     = {{Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control}},
  author    = {Makelov, Aleksandar and Lange, Georg and Nanda, Neel},
  booktitle = {International Conference on Learning Representations},
  year      = {2025},
  url       = {https://mlanthology.org/iclr/2025/makelov2025iclr-principled/}
}