DRAW: Domain Weight Randomization with Bayesian Updating for LLM Pre-Training

Wang, Ruonan; Qiao, Yongqi; Xie, Zhonglin; Yuan, Kun

DRAW: Domain Weight Randomization with Bayesian Updating for LLM Pre-Training

Ruonan Wang, Yongqi Qiao, Zhonglin Xie, Kun Yuan

TMLR 2026

/tmlr/2026/wang2026tmlr-draw/

Abstract

Optimal pre-training data mixture is pivotal for large language model (LLM) performance, but searching for the best domain weights is computationally expensive. We present Domain Weight Randomization with Bayesian Updating (DRAW), a principled framework treating domain weights as Dirichlet-distributed random variables whose parameters scale with model width. Informative priors are first estimated using proxy models; the main model then refines these using Bayesian inference and parameter scaling, dynamically sampling domain weights during training. Theoretically, DRAW reduces generalization error at a rate $\mathcal{O}(1/\sqrt{n})$ as model width increases, ensuring stable convergence. Empirical results on open-domain corpora and diverse benchmarks show DRAW reliably outperforms fixed and adaptive baselines in both language modeling and downstream tasks, achieving better average and worst-case performance alongside strong robustness. DRAW not only highlights valuable data domains while suppressing noisy ones, but also introduces a scalable and effective mechanism for adaptive data mixing in LLM pre-training, facilitating efficient knowledge transfer from proxy to large models.

PDF TMLR OpenReview Semantic Scholar

Cite

Text

Wang et al. "DRAW: Domain Weight Randomization with Bayesian Updating for LLM Pre-Training." Transactions on Machine Learning Research, 2026.

Markdown

[Wang et al. "DRAW: Domain Weight Randomization with Bayesian Updating for LLM Pre-Training." Transactions on Machine Learning Research, 2026.](https://mlanthology.org/tmlr/2026/wang2026tmlr-draw/)

BibTeX

@article{wang2026tmlr-draw,
  title     = {{DRAW: Domain Weight Randomization with Bayesian Updating for LLM Pre-Training}},
  author    = {Wang, Ruonan and Qiao, Yongqi and Xie, Zhonglin and Yuan, Kun},
  journal   = {Transactions on Machine Learning Research},
  year      = {2026},
  url       = {https://mlanthology.org/tmlr/2026/wang2026tmlr-draw/}
}