MuRating: A High Quality Data Selecting Approach to Multilingual Large Language Model Pretraining

Abstract

Data quality is a critical driver of large language model performance, yet existing model-based selection methods focus almost exclusively on English, neglecting other languages that are essential in the training mix for multilingual LLMs. We introduce MuRating, a scalable framework that transfers high-quality English data-quality signals into a multilingual autorater, capable of handling 17 languages. MuRating aggregates multiple English autoraters via pairwise comparisons to learn unified document quality scores, then projects these judgments through translation to train a multilingual evaluator on monolingual, cross-lingual, and parallel text pairs. Applied to web data, MuRating selects balanced subsets of English and multilingual content to pretrain LLaMA-architecture models of 1.2B and 7B parameters. Compared to strong baselines, including QuRater, FineWeb2-HQ, AskLLM, DCLM, our approach increases average accuracy on both English benchmarks and multilingual evaluations. Extensive analyses further validate that pairwise training provides greater stability and robustness than pointwise scoring, underscoring the effectiveness of MuRating as a general multilingual data-selection framework.

Cite

Text

Chen et al. "MuRating: A High Quality Data Selecting Approach to Multilingual Large Language Model Pretraining." Advances in Neural Information Processing Systems, 2025.

Markdown

[Chen et al. "MuRating: A High Quality Data Selecting Approach to Multilingual Large Language Model Pretraining." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/chen2025neurips-murating/)

BibTeX

@inproceedings{chen2025neurips-murating,
  title     = {{MuRating: A High Quality Data Selecting Approach to Multilingual Large Language Model Pretraining}},
  author    = {Chen, Zhixun and Guo, Ping and Han, Wenhan and Zhang, Yifan and Binbinliu,  and Lin, Haobin and Liu, Fengze and Zhao, Yan and Zhang, Bingni and Wang, Taifeng and Zheng, Yin and Cohn, Trevor and Fang, Meng},
  booktitle = {Advances in Neural Information Processing Systems},
  year      = {2025},
  url       = {https://mlanthology.org/neurips/2025/chen2025neurips-murating/}
}