Mu$^2$SLAM: Multitask, Multilingual Speech and Language Models

Abstract

We present Mu$^2$SLAM, a multilingual sequence-to-sequence model pre-trained jointly on unlabeled speech, unlabeled text and supervised data spanning Automatic Speech Recognition (ASR), Automatic Speech Translation (AST) and Machine Translation (MT), in over 100 languages. By leveraging a quantized representation of speech as a target, Mu$^2$SLAM trains the speech-text models with a sequence-to-sequence masked denoising objective similar to T5 on the decoder and a masked language modeling objective (MLM) on the encoder, for both unlabeled speech and text, while utilizing the supervised tasks to improve cross-lingual and cross-modal representation alignment within the model. On CoVoST AST, Mu$^2$SLAM establishes a new state-of-the-art for models trained on public datasets, improving on xx-en translation over the previous best by 1.9 BLEU points and on en-xx translation by 1.1 BLEU points. On Voxpopuli ASR, our model matches the performance of an mSLAM model fine-tuned with an RNN-T decoder, despite using a relatively weaker Transformer decoder. On text understanding tasks, our model improves by more than 6% over mSLAM on XNLI, getting closer to the performance of mT5 models of comparable capacity on XNLI and TydiQA, paving the way towards a single model for all speech and text understanding tasks.

Cite

Text

Cheng et al. "Mu$^2$SLAM: Multitask, Multilingual Speech and Language Models." International Conference on Machine Learning, 2023.

Markdown

[Cheng et al. "Mu$^2$SLAM: Multitask, Multilingual Speech and Language Models." International Conference on Machine Learning, 2023.](https://mlanthology.org/icml/2023/cheng2023icml-mu/)

BibTeX

@inproceedings{cheng2023icml-mu,
  title     = {{Mu$^2$SLAM: Multitask, Multilingual Speech and Language Models}},
  author    = {Cheng, Yong and Zhang, Yu and Johnson, Melvin and Macherey, Wolfgang and Bapna, Ankur},
  booktitle = {International Conference on Machine Learning},
  year      = {2023},
  pages     = {5504-5520},
  volume    = {202},
  url       = {https://mlanthology.org/icml/2023/cheng2023icml-mu/}
}