Steering Clear: A Systematic Study of Activation Steering in a Toy Setup

Abstract

Activation steering is a promising family of methods for controlling LLM outputs via targeted interventions on model activations. We introduce a toy multi-label classification setup to systematically study activation steering methods, and experiment with several types of steering adapters — from steering vectors (adding a fixed vector to activations) to more expressive adapters involving projections. We evaluate the adapters across steering tasks of different complexities, for three notions of complexity: 1) how densely the features are packed in the representation space (roughly, number of features divided by the dimensionality of the activations), 2) number of attributes steered, and 3) number of values the steered attribute can take. We find that as task complexity is increased, steering vector methods perform worse, while the more expressive methods only take a performance hit when there is not enough data. On the other hand, steering vectors usually outperform more expressive methods in the low-data regime, regardless of task complexity. We conclude by discussing this work's limitations, which include our toy setup not modeling features represented in superposition or continuous features, and the lack of experiments with LLMs.

Cite

Text

Krasheninnikov and Krueger. "Steering Clear: A Systematic Study of Activation Steering in a Toy Setup." NeurIPS 2024 Workshops: MINT, 2024.

Markdown

[Krasheninnikov and Krueger. "Steering Clear: A Systematic Study of Activation Steering in a Toy Setup." NeurIPS 2024 Workshops: MINT, 2024.](https://mlanthology.org/neuripsw/2024/krasheninnikov2024neuripsw-steering/)

BibTeX

@inproceedings{krasheninnikov2024neuripsw-steering,
  title     = {{Steering Clear: A Systematic Study of Activation Steering in a Toy Setup}},
  author    = {Krasheninnikov, Dmitrii and Krueger, David},
  booktitle = {NeurIPS 2024 Workshops: MINT},
  year      = {2024},
  url       = {https://mlanthology.org/neuripsw/2024/krasheninnikov2024neuripsw-steering/}
}