Device-Directed Speech Detection for Follow-up Conversations Using Large Language Models

Abstract

Follow-up conversations with virtual assistants (VAs) enable a user to seamlessly interact with a VA without the need to repeatedly invoke it using a keyword (after the first query). Therefore, accurate Device-directed Speech Detection (DDSD) from the follow-up queries is critical for enabling naturalistic user experience. To this end, we explore the notion of Large Language Models (LLMs) and model the first query when making inference about the follow-ups (based on the ASR-decoded text), via prompting of a pretrained LLM, or by adapting a binary classifier on top of the LLM. In doing so, we also exploit the ASR uncertainty when designing the LLM prompts. We show on the real-world dataset of follow-up conversations that this approach yields large gains (20-40% reduction in false alarms at 10% fixed false rejects) due to the joint modeling of the previous speech context and ASR uncertainty, compared to when follow-ups are modeled alone.

Cite

Text

Rudovic et al. "Device-Directed Speech Detection for Follow-up Conversations Using Large Language Models." NeurIPS 2024 Workshops: AFM, 2024.

Markdown

[Rudovic et al. "Device-Directed Speech Detection for Follow-up Conversations Using Large Language Models." NeurIPS 2024 Workshops: AFM, 2024.](https://mlanthology.org/neuripsw/2024/rudovic2024neuripsw-devicedirected/)

BibTeX

@inproceedings{rudovic2024neuripsw-devicedirected,
  title     = {{Device-Directed Speech Detection for Follow-up Conversations Using Large Language Models}},
  author    = {Rudovic, Ognjen and Dighe, Pranay and Su, Yi and Garg, Vineet and Dharur, Sameer and Niu, Xiaochuan and Abdelaziz, Ahmed Hussen and Adya, Saurabh and Tewfik, Ahmed},
  booktitle = {NeurIPS 2024 Workshops: AFM},
  year      = {2024},
  url       = {https://mlanthology.org/neuripsw/2024/rudovic2024neuripsw-devicedirected/}
}