Generating Vocals from Lyrics and Musical Accompaniment
Abstract
In this work, we introduce AutoSing, a novel framework designed to generate diverse and high-quality singing voices from provided lyrics and musical accompaniment. AutoSing extends an existing semantic token-based text-to-speech approach by incorporating musical accompaniment as an additional conditioning input. This enables AutoSing to synchronize its vocal output with the rhythm and melodic nuances of the accompaniment while adhering to the provided lyrics. Our contributions include a novel training scheme for autoregressive audio models applied to singing voice synthesis, as well as ablation studies to identify the best way to condition generation on musical accompaniment. We measure AutoSing's performance with subjective listening tests, demonstrating its capability to generate coherent and creative singing voices. Furthermore, we open-source our codebase to foster further research in the field of singing voice synthesis.
Cite
Text
Streich et al. "Generating Vocals from Lyrics and Musical Accompaniment." NeurIPS 2024 Workshops: Audio_Imagination, 2024.Markdown
[Streich et al. "Generating Vocals from Lyrics and Musical Accompaniment." NeurIPS 2024 Workshops: Audio_Imagination, 2024.](https://mlanthology.org/neuripsw/2024/streich2024neuripsw-generating/)BibTeX
@inproceedings{streich2024neuripsw-generating,
title = {{Generating Vocals from Lyrics and Musical Accompaniment}},
author = {Streich, Georg and Lanzendörfer, Luca A and Grötschla, Florian and Wattenhofer, Roger},
booktitle = {NeurIPS 2024 Workshops: Audio_Imagination},
year = {2024},
url = {https://mlanthology.org/neuripsw/2024/streich2024neuripsw-generating/}
}