ExpertLens: Activation Steering Features Are Highly Interpretable

Abstract

Activation steering methods in large language models (LLMs) have emerged as an effective way to perform targeted updates to enhance generated language without requiring large amounts of adaptation data. We ask whether the features discovered by activation steering methods are interpretable. We identify neurons responsible for specific concepts (e.g., ''cat'') using the ''finding experts'' method from research on activation steering and show that the ExpertLens, i.e., inspection of these neurons, provides insights about model representation. We find that ExpertLens representations are stable across models and datasets and closely align with human representations inferred from behavioral data, matching inter-human alignment levels. ExpertLens significantly outperforms the alignment captured by word/sentence embeddings and sparse autoencoder (SAE) features. By reconstructing human concept organization through ExpertLens, we show that it enables a granular view of LLM concept representation. Our findings suggest that ExpertLens is a flexible and lightweight approach for capturing and analyzing model representations.

Cite

Text

Fedzechkina et al. "ExpertLens: Activation Steering Features Are Highly Interpretable." Transactions on Machine Learning Research, 2026.

Markdown

[Fedzechkina et al. "ExpertLens: Activation Steering Features Are Highly Interpretable." Transactions on Machine Learning Research, 2026.](https://mlanthology.org/tmlr/2026/fedzechkina2026tmlr-expertlens/)

BibTeX

@article{fedzechkina2026tmlr-expertlens,
  title     = {{ExpertLens: Activation Steering Features Are Highly Interpretable}},
  author    = {Fedzechkina, Masha and Gualdoni, Eleonora and Williamson, Sinead and Metcalf, Katherine and Seto, Skyler and Theobald, Barry-John},
  journal   = {Transactions on Machine Learning Research},
  year      = {2026},
  url       = {https://mlanthology.org/tmlr/2026/fedzechkina2026tmlr-expertlens/}
}