Unsupervised Speech Recognition

Abstract

Despite rapid progress in the recent past, current speech recognition systems still require labeled training data which limits this technology to a small fraction of the languages spoken around the globe. This paper describes wav2vec-U, short for wav2vec Unsupervised, a method to train speech recognition models without any labeled data. We leverage self-supervised speech representations to segment unlabeled audio and learn a mapping from these representations to phonemes via adversarial training. The right representations are key to the success of our method. Compared to the best previous unsupervised work, wav2vec-U reduces the phone error rate on the TIMIT benchmark from 26.1 to 11.3. On the larger English Librispeech benchmark, wav2vec-U achieves a word error rate of 5.9 on test-other, rivaling some of the best published systems trained on 960 hours of labeled data from only two years ago. We also experiment on nine other languages, including low-resource languages such as Kyrgyz, Swahili and Tatar.

Cite

Text

Baevski et al. "Unsupervised Speech Recognition." Neural Information Processing Systems, 2021.

Markdown

[Baevski et al. "Unsupervised Speech Recognition." Neural Information Processing Systems, 2021.](https://mlanthology.org/neurips/2021/baevski2021neurips-unsupervised/)

BibTeX

@inproceedings{baevski2021neurips-unsupervised,
  title     = {{Unsupervised Speech Recognition}},
  author    = {Baevski, Alexei and Hsu, Wei-Ning and Conneau, Alexis and Auli, Michael},
  booktitle = {Neural Information Processing Systems},
  year      = {2021},
  url       = {https://mlanthology.org/neurips/2021/baevski2021neurips-unsupervised/}
}