ORDerly: Datasets and Benchmarks for Chemical Reaction Data

Abstract

Machine learning has the potential to provide tremendous value to the life sciences by providing models that aid in the discovery of new molecules and reduce the time for new products to come to market. Chemical reactions play a significant role in these fields, but there is a lack of high-quality open-source chemical reaction datasets for training ML models. Herein, we present ORDerly, an open-source Python package for customizable and reproducible preparation of reaction data stored in accordance with the increasingly popular Open Reaction Database (ORD) schema. We use ORDerly to clean US patent data stored in ORD and generate datasets for forward prediction, retrosynthesis, as well as the first benchmark for reaction condition prediction. We train neural networks on datasets generated with ORDerly for condition prediction and show that datasets missing key cleaning steps can lead to silently overinflated performance metrics. Additionally, we train transformers for forward and retrosynthesis prediction and demonstrate how non-patent data can be used to evaluate model generalisation. By providing a customizable open-source solution for cleaning and preparing large chemical reaction data, ORDerly is poised to push forward the boundaries of machine learning applications in chemistry.

Cite

Text

Wigh et al. "ORDerly: Datasets and Benchmarks for Chemical Reaction Data." NeurIPS 2023 Workshops: AI4Science, 2023.

Markdown

[Wigh et al. "ORDerly: Datasets and Benchmarks for Chemical Reaction Data." NeurIPS 2023 Workshops: AI4Science, 2023.](https://mlanthology.org/neuripsw/2023/wigh2023neuripsw-orderly/)

BibTeX

@inproceedings{wigh2023neuripsw-orderly,
  title     = {{ORDerly: Datasets and Benchmarks for Chemical Reaction Data}},
  author    = {Wigh, Daniel and Arrowsmith, Joe and Pomberger, Alexander and Felton, Kobi and Lapkin, Alexei},
  booktitle = {NeurIPS 2023 Workshops: AI4Science},
  year      = {2023},
  url       = {https://mlanthology.org/neuripsw/2023/wigh2023neuripsw-orderly/}
}