BatchTopK Sparse Autoencoders
Abstract
Sparse autoencoders (SAEs) have emerged as a powerful tool for interpreting language model activations by decomposing them into sparse, interpretable features. A popular approach is the TopK SAE, that uses a fixed number of the most active latents per sample to reconstruct the model activations. We introduce BatchTopK SAEs, a training method that improves upon TopK SAEs by relaxing the top-k constraint to the batch-level, allowing for a variable number of latents to be active per sample. BatchTopK SAEs consistently outperform TopK SAEs at reconstructing activations from GPT-2 Small and Gemma 2 2B. BatchTopK SAEs achieve comparable reconstruction performance to the state-of-the-art JumpReLU SAE, but have the advantage that the average number of latents can be directly specified, rather than approximately tuned through a costly hyperparameter sweep. We provide code for training and evaluating these BatchTopK SAEs at [redacted].
Cite
Text
Bussmann et al. "BatchTopK Sparse Autoencoders." NeurIPS 2024 Workshops: SciForDL, 2024.Markdown
[Bussmann et al. "BatchTopK Sparse Autoencoders." NeurIPS 2024 Workshops: SciForDL, 2024.](https://mlanthology.org/neuripsw/2024/bussmann2024neuripsw-batchtopk/)BibTeX
@inproceedings{bussmann2024neuripsw-batchtopk,
title = {{BatchTopK Sparse Autoencoders}},
author = {Bussmann, Bart and Leask, Patrick and Nanda, Neel},
booktitle = {NeurIPS 2024 Workshops: SciForDL},
year = {2024},
url = {https://mlanthology.org/neuripsw/2024/bussmann2024neuripsw-batchtopk/}
}