Improving End-to-End Speech Translation by Leveraging Auxiliary Speech and Text Data

Abstract

We present a method for introducing a text encoder into pre-trained end-to-end speech translation systems. It enhances the ability of adapting one modality (i.e., source-language speech) to another (i.e., source-language text). Thus, the speech translation model can learn from both unlabeled and labeled data, especially when the source-language text data is abundant. Beyond this, we present a denoising method to build a robust text encoder that can deal with both normal and noisy text data. Our system sets new state-of-the-arts on the MuST-C En-De, En-Fr, and LibriSpeech En-Fr tasks.

Cite

Text

Zhang et al. "Improving End-to-End Speech Translation by Leveraging Auxiliary Speech and Text Data." AAAI Conference on Artificial Intelligence, 2023. doi:10.1609/AAAI.V37I11.26637

Markdown

[Zhang et al. "Improving End-to-End Speech Translation by Leveraging Auxiliary Speech and Text Data." AAAI Conference on Artificial Intelligence, 2023.](https://mlanthology.org/aaai/2023/zhang2023aaai-improving/) doi:10.1609/AAAI.V37I11.26637

BibTeX

@inproceedings{zhang2023aaai-improving,
  title     = {{Improving End-to-End Speech Translation by Leveraging Auxiliary Speech and Text Data}},
  author    = {Zhang, Yuhao and Xu, Chen and Hu, Bojie and Zhang, Chunliang and Xiao, Tong and Zhu, Jingbo},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2023},
  pages     = {13984-13992},
  doi       = {10.1609/AAAI.V37I11.26637},
  url       = {https://mlanthology.org/aaai/2023/zhang2023aaai-improving/}
}