Regularizing Black-Box Models for Improved Interpretability
Abstract
Most of the work on interpretable machine learning has focused on designing either inherently interpretable models, which typically trade-off accuracy for interpretability, or post-hoc explanation systems, whose explanation quality can be unpredictable. Our method, ExpO, is a hybridization of these approaches that regularizes a model for explanation quality at training time. Importantly, these regularizers are differentiable, model agnostic, and require no domain knowledge to define. We demonstrate that post-hoc explanations for ExpO-regularized models have better explanation quality, as measured by the common fidelity and stability metrics. We verify that improving these metrics leads to significantly more useful explanations with a user study on a realistic task.
Cite
Text
Plumb et al. "Regularizing Black-Box Models for Improved Interpretability." Neural Information Processing Systems, 2020.Markdown
[Plumb et al. "Regularizing Black-Box Models for Improved Interpretability." Neural Information Processing Systems, 2020.](https://mlanthology.org/neurips/2020/plumb2020neurips-regularizing/)BibTeX
@inproceedings{plumb2020neurips-regularizing,
title = {{Regularizing Black-Box Models for Improved Interpretability}},
author = {Plumb, Gregory and Al-Shedivat, Maruan and Cabrera, Ángel Alexander and Perer, Adam and Xing, Eric P. and Talwalkar, Ameet},
booktitle = {Neural Information Processing Systems},
year = {2020},
url = {https://mlanthology.org/neurips/2020/plumb2020neurips-regularizing/}
}