The Logical Implication Steering Method for Conditional Interventions on Transformer Generation

Abstract

The field of mechanistic interpretability in pre-trained transformer models has demonstrated substantial evidence supporting the ”linear representation hypothesis”, which is the idea that high level concepts are encoded as vectors in the space of activations of a model. Studies also show that model generation behavior can be steered toward a given concept by adding the concept’s vector to the corresponding activations. We show how to leverage these properties to build a form of logical implication into models, enabling transparent and interpretable adjustments that induce a chosen generation behavior in response to the presence of any given concept. Our method, Logical Implication Model Steering (LIMS), unlocks new hand-engineered reasoning capabilities by integrating neuro-symbolic logic into pre-trained transformer models.

Cite

Text

Kalajdzievski. "The Logical Implication Steering Method for Conditional Interventions on Transformer Generation." Proceedings of the 42nd International Conference on Machine Learning, 2025.

Markdown

[Kalajdzievski. "The Logical Implication Steering Method for Conditional Interventions on Transformer Generation." Proceedings of the 42nd International Conference on Machine Learning, 2025.](https://mlanthology.org/icml/2025/kalajdzievski2025icml-logical/)

BibTeX

@inproceedings{kalajdzievski2025icml-logical,
  title     = {{The Logical Implication Steering Method for Conditional Interventions on Transformer Generation}},
  author    = {Kalajdzievski, Damjan},
  booktitle = {Proceedings of the 42nd International Conference on Machine Learning},
  year      = {2025},
  pages     = {28689-28720},
  volume    = {267},
  url       = {https://mlanthology.org/icml/2025/kalajdzievski2025icml-logical/}
}