Localizing Auditory Concepts in CNNs
Abstract
Deep learning models are capable of complex auditory processing tasks such as keyword spotting, genre classification, and audio captioning, yet remain opaque. While several works have explored interpretability of neural networks for computer vision and natural language processing, the audio modality has been largely ignored. In this paper, we study the behavior of the audio CNN encoder used in the contrastively trained language-audio model, CLAP. In the domain of music and human speech sounds, we localize and identify the layers of the network that perform well on tasks of varying complexity, sometimes even outperforming the model's final outputs. Digging deeper, we also localize specific dataset classes to neuron clusters within a layer and analyze a cluster’s contribution to the model’s discriminability for that class. To perform these analyses, we propose an automated framework that can leverage a small dataset of a few thousand samples to evaluate and score neuron clusters for their role in classification. Our findings provide insights into the hierarchical nature of representations in audio CNNs, paving the way for improved interpretability of audio model.
Cite
Text
Gautam et al. "Localizing Auditory Concepts in CNNs." ICML 2024 Workshops: MI, 2024.Markdown
[Gautam et al. "Localizing Auditory Concepts in CNNs." ICML 2024 Workshops: MI, 2024.](https://mlanthology.org/icmlw/2024/gautam2024icmlw-localizing/)BibTeX
@inproceedings{gautam2024icmlw-localizing,
title = {{Localizing Auditory Concepts in CNNs}},
author = {Gautam, Pratyaksh and Tapaswi, Makarand and Alluri, Vinoo},
booktitle = {ICML 2024 Workshops: MI},
year = {2024},
url = {https://mlanthology.org/icmlw/2024/gautam2024icmlw-localizing/}
}