Building Bridges, Not Walls: Advancing Interpretability by Unifying Feature, Data, and Model Component Attribution
Abstract
The increasing complexity of AI systems has made understanding their behavior and building trust in them a critical challenge, especially for large language models. Numerous methods have been developed to attribute model behavior to three key aspects: input features, training data, and internal model components. However, these attribution methods are studied and applied rather independently, resulting in a fragmented landscape of approaches and terminology. We argues that feature, data, and component attribution methods share fundamental similarities, and bridging them can benefit interpretability research. We conduct a detailed analysis of successful methods of these three attribution aspects and present a unified view to demonstrate that they employ similar approaches: perturbations, gradients, and linear approximations. Our unified view enhances understanding of attribution methods and highlights new directions for interpretability and broader AI areas, including model editing, steering, and regulation.
Cite
Text
Zhang et al. "Building Bridges, Not Walls: Advancing Interpretability by Unifying Feature, Data, and Model Component Attribution." ICLR 2025 Workshops: BuildingTrust, 2025.Markdown
[Zhang et al. "Building Bridges, Not Walls: Advancing Interpretability by Unifying Feature, Data, and Model Component Attribution." ICLR 2025 Workshops: BuildingTrust, 2025.](https://mlanthology.org/iclrw/2025/zhang2025iclrw-building/)BibTeX
@inproceedings{zhang2025iclrw-building,
title = {{Building Bridges, Not Walls: Advancing Interpretability by Unifying Feature, Data, and Model Component Attribution}},
author = {Zhang, Shichang and Han, Tessa and Bhalla, Usha and Lakkaraju, Himabindu},
booktitle = {ICLR 2025 Workshops: BuildingTrust},
year = {2025},
url = {https://mlanthology.org/iclrw/2025/zhang2025iclrw-building/}
}