Goud.ma: A News Article Dataset for Summarization in Moroccan Darija

Abstract

Moroccan Darija is a vernacular spoken by over 30 million people primarily in Morocco. Despite a high number of speakers, it remains a low-resource language. In this paper, we introduce Goud.ma: a dataset of over 158k news articles for automatic summarization in code-switched Moroccan Darija. We analyze the dataset and find that it requires a high level of abstractive reasoning. We fine-tune the Arabic-language BERT (AraBERT), and the language models for the Moroccan (DarijaBERT), and Algerian (DziriBERT) national vernaculars for summarization on Goud.ma. The results show that Goud.ma is a challenging summarization benchmark dataset. We release our dataset publicly in an effort to encourage the diversity of evaluation tasks to improve language modeling in Moroccan Darija.

Cite

Text

Issam and Mrini. "Goud.ma: A News Article Dataset for Summarization in Moroccan Darija." ICLR 2022 Workshops: AfricaNLP, 2022.

Markdown

[Issam and Mrini. "Goud.ma: A News Article Dataset for Summarization in Moroccan Darija." ICLR 2022 Workshops: AfricaNLP, 2022.](https://mlanthology.org/iclrw/2022/issam2022iclrw-goud/)

BibTeX

@inproceedings{issam2022iclrw-goud,
  title     = {{Goud.ma: A News Article Dataset for Summarization in Moroccan Darija}},
  author    = {Issam, Abderrahmane and Mrini, Khalil},
  booktitle = {ICLR 2022 Workshops: AfricaNLP},
  year      = {2022},
  url       = {https://mlanthology.org/iclrw/2022/issam2022iclrw-goud/}
}