DORA: Exploring Outlier Representations in Deep Neural Networks

Abstract

Although Deep Neural Networks (DNNs) are incredibly effective in learning complex abstractions, they are susceptible to unintentionally learning spurious artifacts from the training data. To ensure model transparency, it is crucial to examine the relationships between learned representations, as unintended concepts often manifest themselves to be anomalous to the desired task. In this work, we introduce DORA (Data-agnOstic Representation Analysis): the first \textit{data-agnostic} framework for the analysis of the representation space of DNNs. Our framework employs the proposed \textit{Extreme-Activation} (EA) distance measure between representations that utilizes self-explaining capabilities within the network itself without accessing any data. We quantitatively validate the metric's correctness and alignment with human-defined semantic distances. The coherence between the EA distance and human judgment enables us to identify representations whose underlying concepts would be considered unnatural by humans by identifying outliers in functional distance. Finally, we demonstrate the practical usefulness of DORA by analyzing and identifying artifact representations in popular Computer Vision models.

Cite

Text

Bykov et al. "DORA: Exploring Outlier Representations in Deep Neural Networks." ICLR 2023 Workshops: Trustworthy_ML, 2023.

Markdown

[Bykov et al. "DORA: Exploring Outlier Representations in Deep Neural Networks." ICLR 2023 Workshops: Trustworthy_ML, 2023.](https://mlanthology.org/iclrw/2023/bykov2023iclrw-dora/)

BibTeX

@inproceedings{bykov2023iclrw-dora,
  title     = {{DORA: Exploring Outlier Representations in Deep Neural Networks}},
  author    = {Bykov, Kirill and Deb, Mayukh and Grinwald, Dennis and Muller, Klaus Robert and Höhne, Marina MC},
  booktitle = {ICLR 2023 Workshops: Trustworthy_ML},
  year      = {2023},
  url       = {https://mlanthology.org/iclrw/2023/bykov2023iclrw-dora/}
}