ALLaM: Large Language Models for Arabic and English
Abstract
In this work, we present ALLaM: Arabic Large Language Model, a series of large language models to support the ecosystem of Arabic Language Technologies (ALT). ALLaM is carefully trained, considering the values of language alignment and transferability of knowledge at scale. The models are based on an autoregressive decoder-only architecture and are pretrained on a mixture of Arabic and English texts. We illustrate how the second-language acquisition via vocabulary expansion can help steer a language model towards a new language without any major catastrophic forgetting in English. Furthermore, we highlight the effectiveness of using translation data and the process of knowledge encoding within the language model's latent space. Finally, we show that effective alignment with human preferences can significantly enhance the performance of a large language model (LLM) compared to less aligned models of a larger scale. Our methodology enables us to achieve state-of-the-art performance in various Arabic benchmarks, including MMLU Arabic, ACVA, and Arabic Exams. Our aligned models improve both in Arabic and English from its base aligned models.
Cite
Text
Bari et al. "ALLaM: Large Language Models for Arabic and English." International Conference on Learning Representations, 2025.Markdown
[Bari et al. "ALLaM: Large Language Models for Arabic and English." International Conference on Learning Representations, 2025.](https://mlanthology.org/iclr/2025/bari2025iclr-allam/)BibTeX
@inproceedings{bari2025iclr-allam,
title = {{ALLaM: Large Language Models for Arabic and English}},
author = {Bari, M Saiful and Alnumay, Yazeed and Alzahrani, Norah A. and Alotaibi, Nouf M. and Alyahya, Hisham Abdullah and AlRashed, Sultan and Mirza, Faisal Abdulrahman and Alsubaie, Shaykhah Z. and Alahmed, Hassan A. and Alabduljabbar, Ghadah and Alkhathran, Raghad and Almushayqih, Yousef and Alnajim, Raneem and Alsubaihi, Salman and Al Mansour, Maryam and Hassan, Saad Amin and Alrubaian, Dr. Majed and Alammari, Ali and Alawami, Zaki and Al-Thubaity, Abdulmohsen and Abdelali, Ahmed and Kuriakose, Jeril and Abujabal, Abdalghani and Al-Twairesh, Nora and Alowisheq, Areeb and Khan, Haidar},
booktitle = {International Conference on Learning Representations},
year = {2025},
url = {https://mlanthology.org/iclr/2025/bari2025iclr-allam/}
}