IndicVoices-R: Unlocking a Massive Multilingual Multi-Speaker Speech Corpus for Scaling Indian TTS
Abstract
Recent advancements in text-to-speech (TTS) synthesis show that large-scale models trained with extensive web data produce highly natural-sounding output. However, such data is scarce for Indian languages due to the lack of high-quality, manually subtitled data on platforms like LibriVox or YouTube. To address this gap, we enhance existing large-scale ASR datasets containing natural conversations collected in low-quality environments to generate high-quality TTS training data. Our pipeline leverages the cross-lingual generalization of denoising and speech enhancement models trained on English and applied to Indian languages. This results in IndicVoices-R (IV-R), the largest multilingual Indian TTS dataset derived from an ASR dataset, with 1,704 hours of high-quality speech from 10,496 speakers across 22 Indian languages. IV-R matches the quality of gold-standard TTS datasets like LJSpeech, LibriTTS, and IndicTTS. We also introduce the IV-R Benchmark, the first to assess zero-shot, few-shot, and many-shot speaker generalization capabilities of TTS models on Indian voices, ensuring diversity in age, gender, and style. We demonstrate that fine-tuning an English pre-trained model on a combined dataset of high-quality IndicTTS and our IV-R dataset results in better zero-shot speaker generalization compared to fine-tuning on the IndicTTS dataset alone. Further, our evaluation reveals limited zero-shot generalization for Indian voices in TTS models trained on prior datasets, which we improve by fine-tuning the model on our data containing diverse set of speakers across language families. We open-source code and data for all 22 official Indian languages.
Cite
Text
Sankar et al. "IndicVoices-R: Unlocking a Massive Multilingual Multi-Speaker Speech Corpus for Scaling Indian TTS." Neural Information Processing Systems, 2024. doi:10.52202/079017-2176Markdown
[Sankar et al. "IndicVoices-R: Unlocking a Massive Multilingual Multi-Speaker Speech Corpus for Scaling Indian TTS." Neural Information Processing Systems, 2024.](https://mlanthology.org/neurips/2024/sankar2024neurips-indicvoicesr/) doi:10.52202/079017-2176BibTeX
@inproceedings{sankar2024neurips-indicvoicesr,
title = {{IndicVoices-R: Unlocking a Massive Multilingual Multi-Speaker Speech Corpus for Scaling Indian TTS}},
author = {Sankar, Ashwin and Anand, Srija and Varadhan, Praveen Srinivasa and Thomas, Sherry and Singal, Mehak and Kumar, Shridhar and Mehendale, Deovrat and Krishana, Aditi and Raju, Giri and Khapra, Mitesh},
booktitle = {Neural Information Processing Systems},
year = {2024},
doi = {10.52202/079017-2176},
url = {https://mlanthology.org/neurips/2024/sankar2024neurips-indicvoicesr/}
}