Tokenized SAEs: Disentangling SAE Reconstructions
Abstract
Sparse auto-encoders (SAEs) have become a prevalent tool for interpreting language models' inner workings. However, it is unknown how strongly SAE features correspond to computationally important directions in the model. We empirically show that many RES-JB SAE features predominantly correspond to simple input statistics. We hypothesize this is caused by a large class imbalance in training data combined with a lack of complex error signals. We propose a method to reduce this behavior by disentangling token reconstruction from feature reconstruction. We achieve this by introducing a per-token bias, which provides an improved baseline for interesting reconstruction. This change yields significantly more interesting features and improved reconstruction in sparse regimes.
Cite
Text
Dooms and Wilhelm. "Tokenized SAEs: Disentangling SAE Reconstructions." ICML 2024 Workshops: MI, 2024.Markdown
[Dooms and Wilhelm. "Tokenized SAEs: Disentangling SAE Reconstructions." ICML 2024 Workshops: MI, 2024.](https://mlanthology.org/icmlw/2024/dooms2024icmlw-tokenized/)BibTeX
@inproceedings{dooms2024icmlw-tokenized,
title = {{Tokenized SAEs: Disentangling SAE Reconstructions}},
author = {Dooms, Thomas and Wilhelm, Daniel},
booktitle = {ICML 2024 Workshops: MI},
year = {2024},
url = {https://mlanthology.org/icmlw/2024/dooms2024icmlw-tokenized/}
}