Dual-Mode ASR: Unify and Improve Streaming ASR with Full-Context Modeling
Abstract
Streaming automatic speech recognition (ASR) aims to emit each hypothesized word as quickly and accurately as possible, while full-context ASR waits for the completion of a full speech utterance before emitting completed hypotheses. In this work, we propose a unified framework, Dual-mode ASR, to train a single end-to-end ASR model with shared weights for both streaming and full-context speech recognition. We show that the latency and accuracy of streaming ASR significantly benefit from weight sharing and joint training of full-context ASR, especially with inplace knowledge distillation during the training. The Dual-mode ASR framework can be applied to recent state-of-the-art convolution-based and transformer-based ASR networks. We present extensive experiments with two state-of-the-art ASR networks, ContextNet and Conformer, on two datasets, a widely used public dataset LibriSpeech and a large-scale dataset MultiDomain. Experiments and ablation studies demonstrate that Dual-mode ASR not only simplifies the workflow of training and deploying streaming and full-context ASR models, but also significantly improves both emission latency and recognition accuracy of streaming ASR. With Dual-mode ASR, we achieve new state-of-the-art streaming ASR results on both LibriSpeech and MultiDomain in terms of accuracy and latency.
Cite
Text
Yu et al. "Dual-Mode ASR: Unify and Improve Streaming ASR with Full-Context Modeling." International Conference on Learning Representations, 2021.Markdown
[Yu et al. "Dual-Mode ASR: Unify and Improve Streaming ASR with Full-Context Modeling." International Conference on Learning Representations, 2021.](https://mlanthology.org/iclr/2021/yu2021iclr-dualmode/)BibTeX
@inproceedings{yu2021iclr-dualmode,
title = {{Dual-Mode ASR: Unify and Improve Streaming ASR with Full-Context Modeling}},
author = {Yu, Jiahui and Han, Wei and Gulati, Anmol and Chiu, Chung-Cheng and Li, Bo and Sainath, Tara N and Wu, Yonghui and Pang, Ruoming},
booktitle = {International Conference on Learning Representations},
year = {2021},
url = {https://mlanthology.org/iclr/2021/yu2021iclr-dualmode/}
}