GlycoNMR: Dataset and Benchmark of Carbohydrate-Specific NMR Chemical Shift for Machine Learning Research
Abstract
Molecular representation learning (MRL) is a powerful contribution by machine learning tochemistry as it converts molecules into numerical representations, which is fundamental fordiverse downstream applications, such as property prediction and drug design. While MRLhas had great success with proteins and general biomolecules, it has yet to be exploredfor carbohydrates in the growing fields of glycoscience and glycomaterials (the study anddesign of carbohydrates). This under-exploration can be primarily attributed to the limitedavailability of comprehensive and well-curated carbohydrate-specific datasets and a lackof machine learning (ML) techniques tailored to meet the unique problems presentedby carbohydrate data. Interpreting and annotating carbohydrate data is generally morecomplicated than protein data and requires substantial domain knowledge. In addition,existing MRL methods were predominately optimized for proteins and small biomoleculesand may not be effective for carbohydrate applications without special modifications. Toaddress this challenge, accelerate progress in glycoscience and glycomaterials, and enrichthe data resources of the ML community, we introduce GlycoNMR. GlycoNMR containstwo laboriously curated datasets with 2,609 carbohydrate structures and 211,543 annotatednuclear magnetic resonance (NMR) atomic-level chemical shifts that can be used to trainML models for precise atomic-level prediction. NMR data is one of the most appealingstarting points for developing ML techniques to facilitate glycoscience and glycomaterialsresearch, as NMR is the preeminent technique in carbohydrate structure research, andbiomolecule structure is among the foremost predictors of functions and properties. Wetailored a set of carbohydrate-specific features and adapted existing 3D-based graph neuralnetworks to tackle the problem of predicting NMR shifts effectively. For illustration, webenchmark these modified MRL models on GlycoNMR.
Cite
Text
Chen et al. "GlycoNMR: Dataset and Benchmark of Carbohydrate-Specific NMR Chemical Shift for Machine Learning Research." Data-centric Machine Learning Research, 2024.Markdown
[Chen et al. "GlycoNMR: Dataset and Benchmark of Carbohydrate-Specific NMR Chemical Shift for Machine Learning Research." Data-centric Machine Learning Research, 2024.](https://mlanthology.org/dmlr/2024/chen2024dmlr-glyconmr/)BibTeX
@article{chen2024dmlr-glyconmr,
title = {{GlycoNMR: Dataset and Benchmark of Carbohydrate-Specific NMR Chemical Shift for Machine Learning Research}},
author = {Chen, Zizhang and Badman, Ryan Paul and Foley, Bethany Lachele and Woods, Robert J and Hong, Pengyu},
journal = {Data-centric Machine Learning Research},
year = {2024},
pages = {1-37},
volume = {1},
url = {https://mlanthology.org/dmlr/2024/chen2024dmlr-glyconmr/}
}