DePass: Unified Feature Attributing by Simple Decomposed Forward Pass

Abstract

Attributing the behavior of Transformer models to internal computations is a central challenge in mechanistic interpretability. We introduce DePass, a unified framework for feature attribution based on a single decomposed forward pass. DePass decomposes hidden states into customized additive components, then propagates them with attention scores and MLP's activations fixed. It achieves faithful, fine-grained attribution without requiring auxiliary training. We validate DePass across token-level, model component-level, and subspace-level attribution tasks, demonstrating its effectiveness and fidelity. Our experiments highlight its potential to attribute information flow between arbitrary components of a Transformer model. We hope DePass serves as a foundational tool for broader applications in interpretability.

Cite

Text

Hong et al. "DePass: Unified Feature Attributing by Simple Decomposed Forward Pass." Advances in Neural Information Processing Systems, 2025.

Markdown

[Hong et al. "DePass: Unified Feature Attributing by Simple Decomposed Forward Pass." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/hong2025neurips-depass/)

BibTeX

@inproceedings{hong2025neurips-depass,
  title     = {{DePass: Unified Feature Attributing by Simple Decomposed Forward Pass}},
  author    = {Hong, Xiangyu and Jiang, Che and Tian, Kai and Qi, Biqing and Sun, Youbang and Ding, Ning and Zhou, Bowen},
  booktitle = {Advances in Neural Information Processing Systems},
  year      = {2025},
  url       = {https://mlanthology.org/neurips/2025/hong2025neurips-depass/}
}