Internal Causal Mechanisms Robustly Predict Language Model Out-of-Distribution Behaviors

Jing Huang, Junyi Tao, Thomas Icard, Diyi Yang, Christopher Potts

ICML 2025 pp. 25791-25812

/icml/2025/huang2025icml-internal/

Abstract

Interpretability research now offers a variety of techniques for identifying abstract internal mechanisms in neural networks. Can such techniques be used to predict how models will behave on out-of-distribution examples? In this work, we provide a positive answer to this question. Through a diverse set of language modeling tasks—including symbol manipulation, knowledge retrieval, and instruction following—we show that the most robust features for correctness prediction are those that play a distinctive causal role in the model’s behavior. Specifically, we propose two methods that leverage causal mechanisms to predict the correctness of model outputs: counterfactual simulation (checking whether key causal variables are realized) and value probing (using the values of those variables to make predictions). Both achieve high AUC-ROC in distribution and outperform methods that rely on causal-agnostic features in out-of-distribution settings, where predicting model behaviors is more crucial. Our work thus highlights a novel and significant application for internal causal analysis of language models.

PDF ICML OpenReview Semantic Scholar

Cite

Text

Huang et al. "Internal Causal Mechanisms Robustly Predict Language Model Out-of-Distribution Behaviors." Proceedings of the 42nd International Conference on Machine Learning, 2025.

Markdown

[Huang et al. "Internal Causal Mechanisms Robustly Predict Language Model Out-of-Distribution Behaviors." Proceedings of the 42nd International Conference on Machine Learning, 2025.](https://mlanthology.org/icml/2025/huang2025icml-internal/)

BibTeX

@inproceedings{huang2025icml-internal,
  title     = {{Internal Causal Mechanisms Robustly Predict Language Model Out-of-Distribution Behaviors}},
  author    = {Huang, Jing and Tao, Junyi and Icard, Thomas and Yang, Diyi and Potts, Christopher},
  booktitle = {Proceedings of the 42nd International Conference on Machine Learning},
  year      = {2025},
  pages     = {25791-25812},
  volume    = {267},
  url       = {https://mlanthology.org/icml/2025/huang2025icml-internal/}
}