Towards End-to-End Speech Recognition with Recurrent Neural Networks

Abstract

This paper presents a speech recognition system that directly transcribes audio data with text, without requiring an intermediate phonetic representation. The system is based on a combination of the deep bidirectional LSTM recurrent neural network architecture and the Connectionist Temporal Classification objective function. A modification to the objective function is introduced that trains the network to minimise the expectation of an arbitrary transcription loss function. This allows a direct optimisation of the word error rate, even in the absence of a lexicon or language model. The system achieves a word error rate of 27.3% on the Wall Street Journal corpus with no prior linguistic information, 21.9% with only a lexicon of allowed words, and 8.2% with a trigram language model. Combining the network with a baseline system further reduces the error rate to 6.7%.

Cite

Text

Graves and Jaitly. "Towards End-to-End Speech Recognition with Recurrent Neural Networks." International Conference on Machine Learning, 2014.

Markdown

[Graves and Jaitly. "Towards End-to-End Speech Recognition with Recurrent Neural Networks." International Conference on Machine Learning, 2014.](https://mlanthology.org/icml/2014/graves2014icml-endtoend/)

BibTeX

@inproceedings{graves2014icml-endtoend,
  title     = {{Towards End-to-End Speech Recognition with Recurrent Neural Networks}},
  author    = {Graves, Alex and Jaitly, Navdeep},
  booktitle = {International Conference on Machine Learning},
  year      = {2014},
  pages     = {1764-1772},
  volume    = {32},
  url       = {https://mlanthology.org/icml/2014/graves2014icml-endtoend/}
}