Parrot: Autoregressive Spoken Dialogue Language Modeling with Decoder-Only Transformers

Abstract

Recent advancements in large language models (LLMs) have demonstrated significant potential in enhancing real-time spoken interactions. Presently, open-source methodologies predominantly depend on intermediate generative text-based transcriptions to manage real-time spoken dialogues. However, these techniques often struggle with providing seamless interactions that involve real-time streaming audio inputs. In this research, we unveil an innovative spoken dialogue language model, Parrot, distinguished by its unique pre-training and supervised fine-tuning (SFT) pipeline. This pipeline deviates from conventional methodologies by utilizing both single-channel audio data and dual-channel spoken dialogue data to train the textless speech language model. During pre-training, we transform single-channel audio input into a sequence of discrete tokens, thereby instructing the LLM to identify audio tokens via next-token predictions. In the SFT phase, we pioneer a novel approach to dual-channel generative spoken dialogue language modeling with a unique "next-token-pair prediction" objective, facilitating the LLM's comprehension of natural human conversations. Our pipeline equips LLM to produce spoken interactions that are more natural and fluid than those generated by baseline approaches, as substantiated by thorough evaluations.

Cite

Text

Meng et al. "Parrot: Autoregressive Spoken Dialogue Language Modeling with Decoder-Only Transformers." NeurIPS 2024 Workshops: Audio_Imagination, 2024.

Markdown

[Meng et al. "Parrot: Autoregressive Spoken Dialogue Language Modeling with Decoder-Only Transformers." NeurIPS 2024 Workshops: Audio_Imagination, 2024.](https://mlanthology.org/neuripsw/2024/meng2024neuripsw-parrot/)

BibTeX

@inproceedings{meng2024neuripsw-parrot,
  title     = {{Parrot: Autoregressive Spoken Dialogue Language Modeling with Decoder-Only Transformers}},
  author    = {Meng, Ziqiao and Wang, Qichao and Cui, Wenqian and Zhang, Yifei and Wu, Bingzhe and King, Irwin and Chen, Liang and Zhao, Peilin},
  booktitle = {NeurIPS 2024 Workshops: Audio_Imagination},
  year      = {2024},
  url       = {https://mlanthology.org/neuripsw/2024/meng2024neuripsw-parrot/}
}