Fully Neural Network Based Speech Recognition on Mobile and Embedded Devices

Abstract

Real-time automatic speech recognition (ASR) on mobile and embedded devices has been of great interests for many years. We present real-time speech recognition on smartphones or embedded systems by employing recurrent neural network (RNN) based acoustic models, RNN based language models, and beam-search decoding. The acoustic model is end-to-end trained with connectionist temporal classification (CTC) loss. The RNN implementation on embedded devices can suffer from excessive DRAM accesses because the parameter size of a neural network usually exceeds that of the cache memory and the parameters are used only once for each time step. To remedy this problem, we employ a multi-time step parallelization approach that computes multiple output samples at a time with the parameters fetched from the DRAM. Since the number of DRAM accesses can be reduced in proportion to the number of parallelization steps, we can achieve a high processing speed. However, conventional RNNs, such as long short-term memory (LSTM) or gated recurrent unit (GRU), do not permit multi-time step parallelization. We construct an acoustic model by combining simple recurrent units (SRUs) and depth-wise 1-dimensional convolution layers for multi-time step parallelization. Both the character and word piece models are developed for acoustic modeling, and the corresponding RNN based language models are used for beam search decoding. We achieve a competitive WER for WSJ corpus using the entire model size of around 15MB and achieve real-time speed using only a single core ARM without GPU or special hardware.

Cite

Text

Park et al. "Fully Neural Network Based Speech Recognition on Mobile and Embedded Devices." Neural Information Processing Systems, 2018.

Markdown

[Park et al. "Fully Neural Network Based Speech Recognition on Mobile and Embedded Devices." Neural Information Processing Systems, 2018.](https://mlanthology.org/neurips/2018/park2018neurips-fully/)

BibTeX

@inproceedings{park2018neurips-fully,
  title     = {{Fully Neural Network Based Speech Recognition on Mobile and Embedded Devices}},
  author    = {Park, Jinhwan and Boo, Yoonho and Choi, Iksoo and Shin, Sungho and Sung, Wonyong},
  booktitle = {Neural Information Processing Systems},
  year      = {2018},
  pages     = {10620-10630},
  url       = {https://mlanthology.org/neurips/2018/park2018neurips-fully/}
}