An Empirical Analysis of Speech Self-Supervised Learning at Multiple Resolutions

Clark, Theo; Cevoli, Benedetta; de Jong, Eloy; Abramski, Timofey; Dougherty, Jamie

An Empirical Analysis of Speech Self-Supervised Learning at Multiple Resolutions

Theo Clark, Benedetta Cevoli, Eloy de Jong, Timofey Abramski, Jamie Dougherty

NeurIPSW 2024

/neuripsw/2024/clark2024neuripsw-empirical/

Abstract

Self-supervised learning (SSL) models have become crucial in speech processing, with recent advancements concentrating on developing architectures that capture representations across multiple timescales. The primary goal of these multi-scale architectures is to exploit the hierarchical nature of speech, where lower-resolution components aim to capture representations that align with increasingly abstract concepts (e.g., from phones to words to sentences). Although multi-scale approaches have demonstrated some improvements over single-scale models, the precise reasons for these enhancements have poor empirical support. In this study, we present an initial analysis of layer-wise representations in multi-scale architectures, with a focus on Canonical Correlation Analysis (CCA) and Mutual Information (MI). We apply this analysis to Multi-Resolution HuBERT (MR-HuBERT) and find that (1) the improved performance on SUPERB tasks is primarily due to the auxiliary low-resolution loss rather than the downsampling itself, and (2) downsampling to lower resolutions neither improves downstream performance nor correlates with higher-level information (e.g., words), though it does improve computational efficiency. These findings challenge assumptions about the multi-scale nature of MR-HuBERT and motivate the importance of disentangling computational efficiency from learning better representations.

PDF NeurIPSW OpenReview Semantic Scholar

Cite

Text

Clark et al. "An Empirical Analysis of Speech Self-Supervised Learning at Multiple Resolutions." NeurIPS 2024 Workshops: SSL, 2024.

Markdown

[Clark et al. "An Empirical Analysis of Speech Self-Supervised Learning at Multiple Resolutions." NeurIPS 2024 Workshops: SSL, 2024.](https://mlanthology.org/neuripsw/2024/clark2024neuripsw-empirical/)

BibTeX

@inproceedings{clark2024neuripsw-empirical,
  title     = {{An Empirical Analysis of Speech Self-Supervised Learning at Multiple Resolutions}},
  author    = {Clark, Theo and Cevoli, Benedetta and de Jong, Eloy and Abramski, Timofey and Dougherty, Jamie},
  booktitle = {NeurIPS 2024 Workshops: SSL},
  year      = {2024},
  url       = {https://mlanthology.org/neuripsw/2024/clark2024neuripsw-empirical/}
}