Inference-Time Decomposition of Activations (ITDA): A Scalable Approach to Interpreting Large Language Models
Abstract
Sparse Autoencoders (SAEs) are a popular method for decomposing Large Language Model (LLM) activations into interpretable latents, however they have a substantial training cost and SAEs learned on different models are not directly comparable. Motivated by relative representation similarity measures, we introduce Inference-Time Decomposition of Activation models (ITDAs). ITDAs are constructed by greedily sampling activations into a dictionary based on an error threshold on their matching pursuit reconstruction. ITDAs can be trained in 1% of the time of SAEs, allowing us to cheaply train them on Llama-3.1 70B and 405B. ITDA dictionaries also enable cross-model comparisons, and outperform existing methods like CKA, SVCCA, and a relative representation method on a benchmark of representation similarity. Code available at https://github.com/pleask/itda.
Cite
Text
Leask et al. "Inference-Time Decomposition of Activations (ITDA): A Scalable Approach to Interpreting Large Language Models." Proceedings of the 42nd International Conference on Machine Learning, 2025.Markdown
[Leask et al. "Inference-Time Decomposition of Activations (ITDA): A Scalable Approach to Interpreting Large Language Models." Proceedings of the 42nd International Conference on Machine Learning, 2025.](https://mlanthology.org/icml/2025/leask2025icml-inferencetime/)BibTeX
@inproceedings{leask2025icml-inferencetime,
title = {{Inference-Time Decomposition of Activations (ITDA): A Scalable Approach to Interpreting Large Language Models}},
author = {Leask, Patrick and Nanda, Neel and Al Moubayed, Noura},
booktitle = {Proceedings of the 42nd International Conference on Machine Learning},
year = {2025},
pages = {32803-32829},
volume = {267},
url = {https://mlanthology.org/icml/2025/leask2025icml-inferencetime/}
}