Enhancing Multilingual LLM Pretraining with Model-Based Data Selection

Abstract

Dataset curation has become a basis for strong large language model (LLM) performance. While various rule-based filtering heuristics exist for English and multilingual datasets, model-based filtering techniques have primarily focused on English. To address the disparity stemming from limited research on non-English languages, we propose a model-based filtering framework for multilingual datasets that aims to identify a diverse set of structured and knowledge-rich samples. Our approach emphasizes transparency, simplicity, and efficiency, leveraging Transformer- and FastText-based classifiers to ensure the broad accessibility of our technique and data. We conduct comprehensive ablation studies on the FineWeb-2 web crawl dataset across diverse language families, scripts, and resource availability to demonstrate the effectiveness of our method. Using a 1B-parameter Llama model trained on 70B and 119B tokens, our approach can match the baseline MMLU score with as little as 15\% of the training tokens, while also improving across other benchmarks. These findings provide strong evidence for the generalizability of our approach to other languages. As a result, we extend our framework to 20 languages for which we will release the refined pretraining datasets.

Cite

Text

Messmer et al. "Enhancing Multilingual LLM Pretraining with Model-Based Data Selection." ICLR 2025 Workshops: Data_Problems, 2025.

Markdown

[Messmer et al. "Enhancing Multilingual LLM Pretraining with Model-Based Data Selection." ICLR 2025 Workshops: Data_Problems, 2025.](https://mlanthology.org/iclrw/2025/messmer2025iclrw-enhancing/)

BibTeX

@inproceedings{messmer2025iclrw-enhancing,
  title     = {{Enhancing Multilingual LLM Pretraining with Model-Based Data Selection}},
  author    = {Messmer, Bettina and Sabolčec, Vinko and Jaggi, Martin},
  booktitle = {ICLR 2025 Workshops: Data_Problems},
  year      = {2025},
  url       = {https://mlanthology.org/iclrw/2025/messmer2025iclrw-enhancing/}
}