Learning Explainable Models Using Attribution Priors

Abstract

Two important topics in deep learning both involve incorporating humans into the modeling process: Model priors transfer information from humans to a model by regularizing the model's parameters; Model attributions transfer information from a model to humans by explaining the model's behavior. Previous work has taken important steps to connect these topics through various forms of gradient regularization. We find, however, that existing methods that use attributions to align a model's behavior with human intuition are ineffective. We develop an efficient and theoretically grounded feature attribution method, expected gradients, and a novel framework, attribution priors, to enforce prior expectations about a model's behavior during training. We demonstrate that attribution priors are broadly applicable by instantiating them on three different types of data: image data, gene expression data, and health care data. Our experiments show that models trained with attribution priors are more intuitive and achieve better generalization performance than both equivalent baselines and existing methods to regularize model behavior.

Cite

Text

Erion et al. "Learning Explainable Models Using Attribution Priors." International Conference on Learning Representations, 2020.

Markdown

[Erion et al. "Learning Explainable Models Using Attribution Priors." International Conference on Learning Representations, 2020.](https://mlanthology.org/iclr/2020/erion2020iclr-learning/)

BibTeX

@inproceedings{erion2020iclr-learning,
  title     = {{Learning Explainable Models Using Attribution Priors}},
  author    = {Erion, Gabriel and Janizek, Joseph D. and Sturmfels, Pascal and Lundberg, Scott and Lee, Su-In},
  booktitle = {International Conference on Learning Representations},
  year      = {2020},
  url       = {https://mlanthology.org/iclr/2020/erion2020iclr-learning/}
}