Do Concept Bottleneck Models Respect Localities?
Abstract
Concept-based explainability methods use human-understandable intermediaries to produce explanations for machine learning models. These methods assume concept predictions can help understand a model's internal reasoning. In this work, we assess the degree to which such an assumption is true by analyzing whether concept predictors leverage "relevant" features to make predictions, a term we call locality. Concept-based models that fail to respect localities also fail to be explainable because concept predictions are based on spurious features, making the interpretation of the concept predictions vacuous. To assess whether concept-based models respect localities, we construct and use three metrics to characterize when models respect localities, complementing our analysis with theoretical results. Each of our metrics captures a different notion of perturbation and assess whether perturbing "irrelevant" features impacts the predictions made by a concept predictors. We find that many concept-based models used in practice fail to respect localities because concept predictors cannot always clearly distinguish distinct concepts. Based on these findings, we propose suggestions for alleviating this issue.
Cite
Text
Raman et al. "Do Concept Bottleneck Models Respect Localities?." Transactions on Machine Learning Research, 2025.Markdown
[Raman et al. "Do Concept Bottleneck Models Respect Localities?." Transactions on Machine Learning Research, 2025.](https://mlanthology.org/tmlr/2025/raman2025tmlr-concept/)BibTeX
@article{raman2025tmlr-concept,
title = {{Do Concept Bottleneck Models Respect Localities?}},
author = {Raman, Naveen Janaki and Zarlenga, Mateo Espinosa and Heo, Juyeon and Jamnik, Mateja},
journal = {Transactions on Machine Learning Research},
year = {2025},
url = {https://mlanthology.org/tmlr/2025/raman2025tmlr-concept/}
}