Tokenized SAEs: Disentangling SAE Reconstructions

Abstract

Sparse auto-encoders (SAEs) have become a prevalent tool for interpreting language models' inner workings. However, it is unknown how strongly SAE features correspond to computationally important directions in the model. We empirically show that many RES-JB SAE features predominantly correspond to simple input statistics. We hypothesize this is caused by a large class imbalance in training data combined with a lack of complex error signals. We propose a method to reduce this behavior by disentangling token reconstruction from feature reconstruction. We achieve this by introducing a per-token bias, which provides an improved baseline for interesting reconstruction. This change yields significantly more interesting features and improved reconstruction in sparse regimes.

Cite

Text

Dooms and Wilhelm. "Tokenized SAEs: Disentangling SAE Reconstructions." ICML 2024 Workshops: MI, 2024.

Markdown

[Dooms and Wilhelm. "Tokenized SAEs: Disentangling SAE Reconstructions." ICML 2024 Workshops: MI, 2024.](https://mlanthology.org/icmlw/2024/dooms2024icmlw-tokenized/)

BibTeX

@inproceedings{dooms2024icmlw-tokenized,
  title     = {{Tokenized SAEs: Disentangling SAE Reconstructions}},
  author    = {Dooms, Thomas and Wilhelm, Daniel},
  booktitle = {ICML 2024 Workshops: MI},
  year      = {2024},
  url       = {https://mlanthology.org/icmlw/2024/dooms2024icmlw-tokenized/}
}