InstructSpeech: Following Speech Editing Instructions via Large Language Models
Abstract
Instruction-guided speech editing aims to follow the user’s natural language instruction to manipulate the semantic and acoustic attributes of a speech. In this work, we construct triplet paired data (instruction, input speech, output speech) to alleviate data scarcity and train a multi-task large language model named InstructSpeech. To mitigate the challenges of accurately executing user’s instructions, we 1) introduce the learned task embeddings with a fine-tuned Flan-T5-XL to guide the generation process towards the correct generative task; 2) include an extensive and diverse set of speech editing and processing tasks to enhance model capabilities; 3) investigate chain-of-thought reasoning for free-form semantic content editing; and 4) propose a hierarchical adapter that effectively updates a small portion of parameters for generalization to new tasks. To assess instruction speech editing in greater depth, we introduce a benchmark evaluation with contrastive instruction-speech pre-training (CISP) to test the speech quality and instruction-speech alignment faithfulness. Experimental results demonstrate that InstructSpeech achieves state-of-the-art results in eleven tasks, for the first time unlocking the ability to edit speech’s acoustic and semantic attributes following a user’s instruction. Audio samples are available at https://InstructSpeech.github.io
Cite
Text
Huang et al. "InstructSpeech: Following Speech Editing Instructions via Large Language Models." International Conference on Machine Learning, 2024.Markdown
[Huang et al. "InstructSpeech: Following Speech Editing Instructions via Large Language Models." International Conference on Machine Learning, 2024.](https://mlanthology.org/icml/2024/huang2024icml-instructspeech/)BibTeX
@inproceedings{huang2024icml-instructspeech,
title = {{InstructSpeech: Following Speech Editing Instructions via Large Language Models}},
author = {Huang, Rongjie and Hu, Ruofan and Wang, Yongqi and Wang, Zehan and Cheng, Xize and Jiang, Ziyue and Ye, Zhenhui and Yang, Dongchao and Liu, Luping and Gao, Peng and Zhao, Zhou},
booktitle = {International Conference on Machine Learning},
year = {2024},
pages = {19886-19903},
volume = {235},
url = {https://mlanthology.org/icml/2024/huang2024icml-instructspeech/}
}