Automated Feature Labeling with Token-Space Gradient Descent

Abstract

We present a novel approach to feature labeling using gradient descent in token-space. While existing methods typically use language models to generate hypotheses about feature meanings, our method directly optimizes label representations by using a language model as a discriminator to predict feature activations. We formulate this as a multi-objective optimization problem in token-space, balancing prediction accuracy, entropy minimization, and linguistic naturalness. Our proof-of-concept experiments demonstrate successful convergence to interpretable single-token labels across diverse domains, including features for detecting animals, mammals, Chinese text, and numbers. While our current implementation is constrained to single-token labels and relatively simple features, the results suggest that token-space gradient descent could become a valuable addition to the interpretability researcher's toolkit.

Cite

Text

Schulz and Fallows. "Automated Feature Labeling with Token-Space Gradient Descent." ICLR 2025 Workshops: BuildingTrust, 2025.

Markdown

[Schulz and Fallows. "Automated Feature Labeling with Token-Space Gradient Descent." ICLR 2025 Workshops: BuildingTrust, 2025.](https://mlanthology.org/iclrw/2025/schulz2025iclrw-automated/)

BibTeX

@inproceedings{schulz2025iclrw-automated,
  title     = {{Automated Feature Labeling with Token-Space Gradient Descent}},
  author    = {Schulz, Julian and Fallows, Seamus},
  booktitle = {ICLR 2025 Workshops: BuildingTrust},
  year      = {2025},
  url       = {https://mlanthology.org/iclrw/2025/schulz2025iclrw-automated/}
}