Tokenized SAEs: Disentangling SAE Reconstructions

ICMLW 2024

/icmlw/2024/dooms2024icmlw-tokenized/

Abstract

Sparse auto-encoders (SAEs) have become a prevalent tool for interpreting language models' inner workings. However, it is unknown how strongly SAE features correspond to computationally important directions in the model. We empirically show that many RES-JB SAE features predominantly correspond to simple input statistics. We hypothesize this is caused by a large class imbalance in training data combined with a lack of complex error signals. We propose a method to reduce this behavior by disentangling token reconstruction from feature reconstruction. We achieve this by introducing a per-token bias, which provides an improved baseline for interesting reconstruction. This change yields significantly more interesting features and improved reconstruction in sparse regimes.

PDF ICMLW OpenReview Semantic Scholar

Cite

Text

Dooms and Wilhelm. "Tokenized SAEs: Disentangling SAE Reconstructions." ICML 2024 Workshops: MI, 2024.

Markdown

[Dooms and Wilhelm. "Tokenized SAEs: Disentangling SAE Reconstructions." ICML 2024 Workshops: MI, 2024.](https://mlanthology.org/icmlw/2024/dooms2024icmlw-tokenized/)

BibTeX

@inproceedings{dooms2024icmlw-tokenized,
  title     = {{Tokenized SAEs: Disentangling SAE Reconstructions}},
  author    = {Dooms, Thomas and Wilhelm, Daniel},
  booktitle = {ICML 2024 Workshops: MI},
  year      = {2024},
  url       = {https://mlanthology.org/icmlw/2024/dooms2024icmlw-tokenized/}
}