Can Sparse Autoencoders Be Used to Decompose and Interpret Steering Vectors?

Abstract

Steering vectors are a promising method to control the behaviour of large language models. However, their underlying mechanisms remain poorly understood. While representing steering vectors as combinations of sparse autoencoder (SAE) features appears to be a promising direction for interpreting steering vectors, recent findings show that SAE-reconstructed vectors often lack the steering properties of the original vectors. This paper investigates why directly applying SAEs to steering vectors yields misleading decompositions, identifying two reasons: (1) steering vectors fall outside the input distribution for which SAEs are designed, and (2) steering vectors can have meaningful negative projections in feature directions, which SAEs are not designed to accommodate. These limitations hinder the direct use of SAEs for interpreting steering vectors.

Cite

Text

Mayne et al. "Can Sparse Autoencoders Be Used to Decompose and Interpret Steering Vectors?." NeurIPS 2024 Workshops: MINT, 2024.

Markdown

[Mayne et al. "Can Sparse Autoencoders Be Used to Decompose and Interpret Steering Vectors?." NeurIPS 2024 Workshops: MINT, 2024.](https://mlanthology.org/neuripsw/2024/mayne2024neuripsw-sparse-a/)

BibTeX

@inproceedings{mayne2024neuripsw-sparse-a,
  title     = {{Can Sparse Autoencoders Be Used to Decompose and Interpret Steering Vectors?}},
  author    = {Mayne, Harry and Yang, Yushi and Mahdi, Adam},
  booktitle = {NeurIPS 2024 Workshops: MINT},
  year      = {2024},
  url       = {https://mlanthology.org/neuripsw/2024/mayne2024neuripsw-sparse-a/}
}