Investigating Self-Supervised Pre-Training for End-to-End Speech Translation

Abstract

Self-supervised learning from raw speech has been proven beneficial to improve automatic speech recognition (ASR). We investigate here its impact on end-to-end automatic speech translation (AST) performance. We use a contrastive predictive coding (CPC) model pre-trained from unlabeled speech as a feature extractor for a downstream AST task. We show that self-supervised pre-training is particularly efficient in low resource settings and that fine-tuning CPC models on the AST training data further improves performance. Even in higher resource settings, ensembling AST models trained with filter-bank and CPC representations leads to near state-of-the-art models without using any ASR pre-training. This might be particularly beneficial when one needs to develop a system that translates from speech in a language with poorly standardized orthography or even from speech in an unwritten language.

Cite

Text

Nguyen et al. "Investigating Self-Supervised Pre-Training for End-to-End Speech Translation." ICML 2020 Workshops: SAS, 2020.

Markdown

[Nguyen et al. "Investigating Self-Supervised Pre-Training for End-to-End Speech Translation." ICML 2020 Workshops: SAS, 2020.](https://mlanthology.org/icmlw/2020/nguyen2020icmlw-investigating/)

BibTeX

@inproceedings{nguyen2020icmlw-investigating,
  title     = {{Investigating Self-Supervised Pre-Training for End-to-End Speech Translation}},
  author    = {Nguyen, Ha and Bougares, Fethi and Tomashenko, Natalia and Estève, Yannick and Besacier, Laurent},
  booktitle = {ICML 2020 Workshops: SAS},
  year      = {2020},
  url       = {https://mlanthology.org/icmlw/2020/nguyen2020icmlw-investigating/}
}