SCIURus: Shared Circuits for Interpretable Uncertainty Representations in Language Models

Abstract

We investigate the mechanistic sources of uncertainty in large language models (LLMs), an area with important implications for their reliability and trustworthiness. To do so, we conduct a series of experiments designed to identify whether the factuality of generated responses and a model's uncertainty originate in separate or shared circuits in the model architecture. We approach this question by adapting the well-established mechanistic interpretability techniques of path patching and zero-ablation that allows identifying the effect of different circuits on LLM generations. Our extensive experiments on eight different models and five datasets, representing tasks predominantly requiring factual recall, clearly demonstrate that uncertainty is produced in the same parts of a model that are responsible for the factuality of generated responses. We release code for our implementation.

Cite

Text

Teplica et al. "SCIURus: Shared Circuits for Interpretable Uncertainty Representations in Language Models." NeurIPS 2024 Workshops: MINT, 2024.

Markdown

[Teplica et al. "SCIURus: Shared Circuits for Interpretable Uncertainty Representations in Language Models." NeurIPS 2024 Workshops: MINT, 2024.](https://mlanthology.org/neuripsw/2024/teplica2024neuripsw-sciurus/)

BibTeX

@inproceedings{teplica2024neuripsw-sciurus,
  title     = {{SCIURus: Shared Circuits for Interpretable Uncertainty Representations in Language Models}},
  author    = {Teplica, Carter and Liu, Yixin and Cohan, Arman and Rudner, Tim G. J.},
  booktitle = {NeurIPS 2024 Workshops: MINT},
  year      = {2024},
  url       = {https://mlanthology.org/neuripsw/2024/teplica2024neuripsw-sciurus/}
}