ProteinShake: Building Datasets and Benchmarks for Deep Learning on Protein Structures

Abstract

We present ProteinShake, a Python software package that simplifies datasetcreation and model evaluation for deep learning on protein structures. Users cancreate custom datasets or load an extensive set of pre-processed datasets fromthe Protein Data Bank (PDB) and AlphaFoldDB. Each dataset is associated withprediction tasks and evaluation functions covering a broad array of biologicalchallenges. A benchmark on these tasks shows that pre-training almost alwaysimproves performance, the optimal data modality (graphs, voxel grids, or pointclouds) is task-dependent, and models struggle to generalize to new structures.ProteinShake makes protein structure data easily accessible and comparisonamong models straightforward, providing challenging benchmark settings withreal-world implications.ProteinShake is available at: https://proteinshake.ai

Cite

Text

Kucera et al. "ProteinShake: Building Datasets and Benchmarks for Deep Learning on Protein Structures." Neural Information Processing Systems, 2023.

Markdown

[Kucera et al. "ProteinShake: Building Datasets and Benchmarks for Deep Learning on Protein Structures." Neural Information Processing Systems, 2023.](https://mlanthology.org/neurips/2023/kucera2023neurips-proteinshake/)

BibTeX

@inproceedings{kucera2023neurips-proteinshake,
  title     = {{ProteinShake: Building Datasets and Benchmarks for Deep Learning on Protein Structures}},
  author    = {Kucera, Tim and Oliver, Carlos and Chen, Dexiong and Borgwardt, Karsten},
  booktitle = {Neural Information Processing Systems},
  year      = {2023},
  url       = {https://mlanthology.org/neurips/2023/kucera2023neurips-proteinshake/}
}