DinoSR: Self-Distillation and Online Clustering for Self-Supervised Speech Representation Learning

Abstract

In this paper, we introduce self-distillation and online clustering for self-supervised speech representation learning (DinoSR) which combines masked language modeling, self-distillation, and online clustering. We show that these concepts complement each other and result in a strong representation learning model for speech. DinoSR first extracts contextualized embeddings from the input audio with a teacher network, then runs an online clustering system on the embeddings to yield a machine-discovered phone inventory, and finally uses the discretized tokens to guide a student network. We show that DinoSR surpasses previous state-of-the-art performance in several downstream tasks, and provide a detailed analysis of the model and the learned discrete units.

Cite

Text

Liu et al. "DinoSR: Self-Distillation and Online Clustering for Self-Supervised Speech Representation Learning." Neural Information Processing Systems, 2023.

Markdown

[Liu et al. "DinoSR: Self-Distillation and Online Clustering for Self-Supervised Speech Representation Learning." Neural Information Processing Systems, 2023.](https://mlanthology.org/neurips/2023/liu2023neurips-dinosr/)

BibTeX

@inproceedings{liu2023neurips-dinosr,
  title     = {{DinoSR: Self-Distillation and Online Clustering for Self-Supervised Speech Representation Learning}},
  author    = {Liu, Alexander H. and Chang, Heng-Jui and Auli, Michael and Hsu, Wei-Ning and Glass, Jim},
  booktitle = {Neural Information Processing Systems},
  year      = {2023},
  url       = {https://mlanthology.org/neurips/2023/liu2023neurips-dinosr/}
}