Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control

Abstract

A major open problem in mechanistic interpretability is disentangling internal model activations into meaningful features, with recent work focusing on sparse autoencoders (SAEs) as a potential solution. However, verifying that an SAE has found the `right' features in realistic settings has been difficult, as we don't know the (hypothetical) ground-truth features to begin with. In the absence of such ground truth, current evaluation metrics are indirect and rely on proxies, toy models, or other non-trivial assumptions. To overcome this, we propose a new framework to evaluate SAEs: studying how pre-trained language models perform specific tasks, where model activations can be (supervisedly) disentangled in a principled way that allows precise control and interpretability. We develop a task-specific comparison of learned SAEs to our supervised feature decompositions that is \emph{agnostic} to whether the SAE learned the same exact set of features as our supervised method. We instantiate this framework in the indirect object identification (IOI) task on GPT-2 Small, and report on both successes and failures of SAEs in this setting.

Cite

Text

Makelov et al. "Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control." ICLR 2024 Workshops: SeT_LLM, 2024.

Markdown

[Makelov et al. "Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control." ICLR 2024 Workshops: SeT_LLM, 2024.](https://mlanthology.org/iclrw/2024/makelov2024iclrw-principled/)

BibTeX

@inproceedings{makelov2024iclrw-principled,
  title     = {{Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control}},
  author    = {Makelov, Aleksandar and Lange, Georg and Nanda, Neel},
  booktitle = {ICLR 2024 Workshops: SeT_LLM},
  year      = {2024},
  url       = {https://mlanthology.org/iclrw/2024/makelov2024iclrw-principled/}
}