Speech Recognition Datasets for Low-Resource Congolese Languages

Abstract

Large pre-trained Automatic Speech Recognition (ASR) models have begun to perform better in low-resource languages, as a result of the availability of data and transfer learning. However, a small number of languages have sufficient resources to benefit from transfer learning. This paper contributes to expanding speech recognition resources for under-represented languages. We release two new datasets to the research community namely: Lingala Read Speech Corpus consisting of 4 hours labeled audio clips and Congolese Speech Radio Corpus containing 741 hours of unlabeled audio in 4 major spoken languages in the Democratic Republic of the Congo. Additionally, we obtain state-of-the-art results for Congolese wav2vec2. We observe an average decrease of 2 % in WER when a Congolese multilingual pre-trained model is used for finetuning on Lingala. Importantly, our study is the first attempt towards benchmarking speech recognition systems for Lingala and the first-ever multilingual model for 4 Congolese languages spoken by a combined 65 million people. Our data and models will be publicly available, and we hope they help advance research in ASR for low-resource languages.

Cite

Text

Kimanuka et al. "Speech Recognition Datasets for Low-Resource Congolese Languages." ICLR 2023 Workshops: AfricaNLP, 2023.

Markdown

[Kimanuka et al. "Speech Recognition Datasets for Low-Resource Congolese Languages." ICLR 2023 Workshops: AfricaNLP, 2023.](https://mlanthology.org/iclrw/2023/kimanuka2023iclrw-speech/)

BibTeX

@inproceedings{kimanuka2023iclrw-speech,
  title     = {{Speech Recognition Datasets for Low-Resource Congolese Languages}},
  author    = {Kimanuka, Ussen Abre and Maina, Ciira wa and Büyük, Osman},
  booktitle = {ICLR 2023 Workshops: AfricaNLP},
  year      = {2023},
  url       = {https://mlanthology.org/iclrw/2023/kimanuka2023iclrw-speech/}
}