NGLUEni: Benchmarking and Adapting Pretrained Language Models for Nguni Languages

Abstract

The Nguni languages have over 20 million home language speakers in South Africa. There has been considerable growth in datasets for Nguni languages, but no analysis of performance of NLP models for these languages has been reported across all languages and tasks. In this paper we study pretrained language models for the 4 Nguni languages - isiXhosa, isiZulu, isiNdebele, and Siswati. We compile all publicly available datasets for natural language understanding and generation, spanning 6 tasks and 11 datasets. This benchmark, which we call NGLUEni, is the first centralised evaluation suite for the Nguni languages, allowing us to systematically evaluate the Nguni-language capabilities of PLMs. Besides evaluating existing PLMs, we develop new PLMs for the Nguni languages through multilingual adaptive finetuning. Our models, Nguni-XLMR and Nguni-ByT5, outperform their base models and large-scale adapted models, showing that performance gains are obtainable through limited language group-based adaptation. We also perform experiments on cross-lingual transfer and machine translation. Our models achieve notable cross-lingual transfer improvements in the lower resourced Nguni languages (isiNdebele and Siswati). To facilitate future use of NGLUEni as a standardised evaluation suite for the Nguni languages, we create a web portal to access the collection of datasets and publicly release our models.

Cite

Text

Meyer et al. "NGLUEni: Benchmarking and Adapting Pretrained Language Models for Nguni Languages." ICLR 2024 Workshops: AfricaNLP, 2024.

Markdown

[Meyer et al. "NGLUEni: Benchmarking and Adapting Pretrained Language Models for Nguni Languages." ICLR 2024 Workshops: AfricaNLP, 2024.](https://mlanthology.org/iclrw/2024/meyer2024iclrw-nglueni/)

BibTeX

@inproceedings{meyer2024iclrw-nglueni,
  title     = {{NGLUEni: Benchmarking and Adapting Pretrained Language Models for Nguni Languages}},
  author    = {Meyer, Francois and Song, Haiyue and Chakrabarty, Abhisek and Buys, Jan and Dabre, Raj and Tanaka, Hideki},
  booktitle = {ICLR 2024 Workshops: AfricaNLP},
  year      = {2024},
  url       = {https://mlanthology.org/iclrw/2024/meyer2024iclrw-nglueni/}
}