On the Importance of Pretraining Data Alignment for Atomic Property Prediction
Abstract
This paper challenges the recent paradigm in atomic property prediction that links progress to growing dataset sizes and computational resources. We show that pretraining on a carefully selected task-aligned dataset can match or even surpass large-scale joint pretraining while using only 1/24th of the pretraining budget. We introduce the Chemical Similarity Index (CSI), a simple metric for molecular graphs inspired by the Fréchet Inception Distance in computer vision, which quantifies the alignment between upstream pretraining datasets and downstream tasks. By selecting the most aligned dataset with minimal CSI distance, we show that models pretrained on a smaller, focused dataset consistently achieve better performance on downstream tasks than those pretrained on massive, mixed datasets such as JMP. This holds even when the mixed dataset includes the upstream dataset most aligned with the downstream task. Counterintuitively, we also find that indiscriminately adding more data can degrade model performance when the additional data is poorly aligned with the target task. Our findings highlight that quality often outperforms quantity in pretraining for atomic property prediction.
Cite
Text
Ghunaim et al. "On the Importance of Pretraining Data Alignment for Atomic Property Prediction." Transactions on Machine Learning Research, 2026.Markdown
[Ghunaim et al. "On the Importance of Pretraining Data Alignment for Atomic Property Prediction." Transactions on Machine Learning Research, 2026.](https://mlanthology.org/tmlr/2026/ghunaim2026tmlr-importance/)BibTeX
@article{ghunaim2026tmlr-importance,
title = {{On the Importance of Pretraining Data Alignment for Atomic Property Prediction}},
author = {Ghunaim, Yasir M. and Hammoud, Hasan Abed Al Kader and Ghanem, Bernard},
journal = {Transactions on Machine Learning Research},
year = {2026},
url = {https://mlanthology.org/tmlr/2026/ghunaim2026tmlr-importance/}
}