Global Normalization for Streaming Speech Recognition in a Modular Framework

Abstract

We introduce the Globally Normalized Autoregressive Transducer (GNAT) for addressing the label bias problem in streaming speech recognition. Our solution admits a tractable exact computation of the denominator for the sequence-level normalization. Through theoretical and empirical results, we demonstrate that by switching to a globally normalized model, the word error rate gap between streaming and non-streaming speech-recognition models can be greatly reduced (by more than 50% on the Librispeech dataset). This model is developed in a modular framework which encompasses all the common neural speech recognition models. The modularity of this framework enables controlled comparison of modelling choices and creation of new models. A JAX implementation of our models has been open sourced.

Cite

Text

Variani et al. "Global Normalization for Streaming Speech Recognition in a Modular Framework." Neural Information Processing Systems, 2022.

Markdown

[Variani et al. "Global Normalization for Streaming Speech Recognition in a Modular Framework." Neural Information Processing Systems, 2022.](https://mlanthology.org/neurips/2022/variani2022neurips-global/)

BibTeX

@inproceedings{variani2022neurips-global,
  title     = {{Global Normalization for Streaming Speech Recognition in a Modular Framework}},
  author    = {Variani, Ehsan and Wu, Ke and Riley, Michael D and Rybach, David and Shannon, Matt and Allauzen, Cyril},
  booktitle = {Neural Information Processing Systems},
  year      = {2022},
  url       = {https://mlanthology.org/neurips/2022/variani2022neurips-global/}
}