Training Neural Networks for Modularity Aids Interpretability

Abstract

An approach to improve network interpretability is via clusterability, i.e., splitting a model into disjoint clusters that can be studied independently. We find pretrained models to be highly unclusterable and thus train models to be more modular using an "enmeshment loss" function that encourages the formation of non-interacting clusters. Using automated interpretability measures, we show that our method finds clusters that learn different, disjoint, and smaller circuits for CIFAR-10 labels. Our approach provides a promising direction for making neural networks easier to interpret and thereby control.

Cite

Text

Golechha et al. "Training Neural Networks for Modularity Aids Interpretability." NeurIPS 2024 Workshops: SciForDL, 2024.

Markdown

[Golechha et al. "Training Neural Networks for Modularity Aids Interpretability." NeurIPS 2024 Workshops: SciForDL, 2024.](https://mlanthology.org/neuripsw/2024/golechha2024neuripsw-training/)

BibTeX

@inproceedings{golechha2024neuripsw-training,
  title     = {{Training Neural Networks for Modularity Aids Interpretability}},
  author    = {Golechha, Satvik and Cope, Dylan and Schoots, Nandi},
  booktitle = {NeurIPS 2024 Workshops: SciForDL},
  year      = {2024},
  url       = {https://mlanthology.org/neuripsw/2024/golechha2024neuripsw-training/}
}