Training Speech Recognition Models to Follow Instructions

Abstract

Conventional end-to-end Automatic Speech Recognition (ASR) models primarily focus on exact transcription tasks, lacking flexibility for nuanced user interactions. In this paper, we train a speech recognition model to follow a diverse set of free-form text instructions for a multitude of speech recognition tasks -- ranging from simple transcript manipulation to summarization. We emphasize that even without pre-trained LLMs or speech modules, a Listen-Attend-Spell model trained from scratch on Librispeech understands and executes instructions with high fidelity. This preliminary findings highlight the potential of instruction-following training to advance speech foundation models.

Cite

Text

Lai et al. "Training Speech Recognition Models to Follow Instructions." NeurIPS 2023 Workshops: Instruction, 2023.

Markdown

[Lai et al. "Training Speech Recognition Models to Follow Instructions." NeurIPS 2023 Workshops: Instruction, 2023.](https://mlanthology.org/neuripsw/2023/lai2023neuripsw-training/)

BibTeX

@inproceedings{lai2023neuripsw-training,
  title     = {{Training Speech Recognition Models to Follow Instructions}},
  author    = {Lai, Cheng-I and Lu, Zhiyun and Cao, Liangliang and Pang, Ruoming},
  booktitle = {NeurIPS 2023 Workshops: Instruction},
  year      = {2023},
  url       = {https://mlanthology.org/neuripsw/2023/lai2023neuripsw-training/}
}