On the Diversity of ASR Hypotheses in Spoken Language Understanding

Abstract

In Conversational AI, an Automatic Speech Recognition (ASR) system is used to transcribe the user's speech, and the output of the ASR is passed as an input to a Spoken Language Understanding (SLU) system, which outputs semantic objects (such as intent, slot-act pairs, etc.). Recent work, including the state-of-the-art methods in SLU utilize either Word lattices or N-Best Hypotheses from the ASR. The intuition given for using N-Best instead of 1-Best is that the hypotheses provide extra information due to errors in the transcriptions of the ASR system, i.e., the performance gain is attributed to the word-error-rate (WER) of the ASR. We empirically show that the gain in using N-Best hypotheses is not related to WER but related to the diversity of hypotheses.

Cite

Text

Sahu and Dalmia. "On the Diversity of ASR Hypotheses in Spoken Language Understanding." NeurIPS 2022 Workshops: ICBINB, 2022.

Markdown

[Sahu and Dalmia. "On the Diversity of ASR Hypotheses in Spoken Language Understanding." NeurIPS 2022 Workshops: ICBINB, 2022.](https://mlanthology.org/neuripsw/2022/sahu2022neuripsw-diversity/)

BibTeX

@inproceedings{sahu2022neuripsw-diversity,
  title     = {{On the Diversity of ASR Hypotheses in Spoken Language Understanding}},
  author    = {Sahu, Surya Kant and Dalmia, Swaraj},
  booktitle = {NeurIPS 2022 Workshops: ICBINB},
  year      = {2022},
  url       = {https://mlanthology.org/neuripsw/2022/sahu2022neuripsw-diversity/}
}