AGE: Amharic, Ge’ez and English Parallel Dataset

Abstract

African languages are not well-represented in Natural Language Processing (NLP). The main reason is a lack of resources for training models. Low-resource languages, such as Amharic and Ge’ez, cannot benefit from modern NLP methods because of the lack of high-quality datasets. This paper presents AGE, an opensource tripartite alignment of Amharic, Ge’ez, and English parallel dataset. Additionally, we introduced a novel, 1,000 Ge’ez-centered sentences sourced from areas such as news and novels. Furthermore, we developed a model from a multilingual pre-trained language model, which brings 12.29 and 30.66 for English to Ge’ez and Ge’ez to English, respectively, and 9.39 and 12.29 for Amharic-Ge’ez and Ge’ez-Amharic respectively. Our dataset and models are available at the AGE Dataset repository.

Cite

Text

Ademtew and Birbo. "AGE: Amharic, Ge’ez and English Parallel Dataset." ICLR 2024 Workshops: AfricaNLP, 2024.

Markdown

[Ademtew and Birbo. "AGE: Amharic, Ge’ez and English Parallel Dataset." ICLR 2024 Workshops: AfricaNLP, 2024.](https://mlanthology.org/iclrw/2024/ademtew2024iclrw-age/)

BibTeX

@inproceedings{ademtew2024iclrw-age,
  title     = {{AGE: Amharic, Ge’ez and English Parallel Dataset}},
  author    = {Ademtew, Henok Biadglign and Birbo, Mikiyas Girma},
  booktitle = {ICLR 2024 Workshops: AfricaNLP},
  year      = {2024},
  url       = {https://mlanthology.org/iclrw/2024/ademtew2024iclrw-age/}
}