LeMat-Bulk: Aggregating, and De-Duplicating Quantum Chemistry Materials Databases

Abstract

The rapid expansion of material science databases enables the training of predictive machine learning models that deliver fast, accurate estimates of materials properties, as well as generative models that explore the vast combinatorial space of material candidates. Initiatives like the Materials Project, OQMD, and Alexandria have greatly expanded the scope of computational materials science and fueled progress in the materials science community. However, they also introduced challenges related to duplication, data integration, and interoperability which complicates efforts to develop scalable machine learning models. To address these challenges, we introduce LeMat-Bulk, a unified dataset combining Density Functional Theory (DFT) calculations from the Materials Project, OQMD, and Alexandria. This dataset encompasses over 5.3 million materials across three DFT functionals, including the largest repository of PBESol and SCAN functional calculations ($\sim$500k). Our methodology standardizes DFT calculations across databases with varying parameters, resolving inconsistencies and enhancing cross-compatibility. Besides, we propose and benchmark a hashing function (BAWL) built on Ongari et al. (2022) that generates identifiers for crystalline inorganic materials by capturing their structural and compositional properties.

Cite

Text

Siron et al. "LeMat-Bulk: Aggregating, and De-Duplicating Quantum Chemistry Materials Databases." ICLR 2025 Workshops: AI4MAT, 2025.

Markdown

[Siron et al. "LeMat-Bulk: Aggregating, and De-Duplicating Quantum Chemistry Materials Databases." ICLR 2025 Workshops: AI4MAT, 2025.](https://mlanthology.org/iclrw/2025/siron2025iclrw-lematbulk/)

BibTeX

@inproceedings{siron2025iclrw-lematbulk,
  title     = {{LeMat-Bulk: Aggregating, and De-Duplicating Quantum Chemistry Materials Databases}},
  author    = {Siron, Martin and Djafar, Inel and du Fayet, Etienne and Rossello, Amandine and Ramlaoui, Ali and Duval, Alexandre},
  booktitle = {ICLR 2025 Workshops: AI4MAT},
  year      = {2025},
  url       = {https://mlanthology.org/iclrw/2025/siron2025iclrw-lematbulk/}
}