Interpretable Steering of Large Language Models with Feature Guided Activation Additions

Abstract

Effective and reliable control over Large Language Model behavior is a significant challenge. While activation steering methods, which add steering vectors to a model's hidden states, are a promising approach, existing techniques often lack precision and interpretability in how they influence model outputs. We introduce Feature Guided Activation Additions (FGAA), a novel activation steering method that leverages insights from Contrastive Activation Addition (CAA) and Sparse Autoencoder-Targeted Steering (SAE-TS). By operating in the latent space of a Sparse Autoencoder (SAE) and employing optimization techniques to select desired SAE features, FGAA constructs precise, human-interpretable steering vectors that provide better steering effects while maintaining coherence of steered model outputs. In this regard, evaluations on Gemma-2-2B and Gemma-2-9B models across various steering tasks demonstrate that FGAA outperforms existing steering methods of CAA, SAE decoder steering, and SAE-TS. Our results also highlight important trade-offs between steering scale and general model capabilities that are consistent across all tested steering methods.

Cite

Text

Soo et al. "Interpretable Steering of Large Language Models with Feature Guided Activation Additions." ICLR 2025 Workshops: BuildingTrust, 2025.

Markdown

[Soo et al. "Interpretable Steering of Large Language Models with Feature Guided Activation Additions." ICLR 2025 Workshops: BuildingTrust, 2025.](https://mlanthology.org/iclrw/2025/soo2025iclrw-interpretable/)

BibTeX

@inproceedings{soo2025iclrw-interpretable,
  title     = {{Interpretable Steering of Large Language Models with Feature Guided Activation Additions}},
  author    = {Soo, Samuel and Teng, Wesley and Balaganesh, Chandrasekaran and Guoxian, Tan and Yan, Ming},
  booktitle = {ICLR 2025 Workshops: BuildingTrust},
  year      = {2025},
  url       = {https://mlanthology.org/iclrw/2025/soo2025iclrw-interpretable/}
}