Structure-Based Synthetic Data Augmentation for Protein Language Models

Abstract

The goal of $\textit{de novo}$ protein design is to leverage natural proteins to design new ones. Deep generative models of protein structure and sequence are the two dominant $\textit{de novo}$ design paradigms. Structure-based models can produce highly novel proteins, but are constrained by data to produce proteins with a narrow range of topologies. Sequence-based design models produce more natural samples over a wider range of topologies, but with reduced novelty. Here, we propose a structure-based synthetic data augmentation approach to combine the benefits of structure and sequence in generative models of proteins. We generated and characterized 240,830 $\textit{de novo}$ backbone structures and used these backbones to generate 45 million sequences for data augmentation. Models trained with structure-based synthetic data augmentation generate a shifted distribution of proteins that are more likely to express successfully in $\textit{E. coli}$ and are more thermostable. We release the trained models as well as our complete synthetic dataset, BackboneRef.

Cite

Text

Lee et al. "Structure-Based Synthetic Data Augmentation for Protein Language Models." ICLR 2025 Workshops: GEM, 2025.

Markdown

[Lee et al. "Structure-Based Synthetic Data Augmentation for Protein Language Models." ICLR 2025 Workshops: GEM, 2025.](https://mlanthology.org/iclrw/2025/lee2025iclrw-structurebased/)

BibTeX

@inproceedings{lee2025iclrw-structurebased,
  title     = {{Structure-Based Synthetic Data Augmentation for Protein Language Models}},
  author    = {Lee, Alex Jihun and Amini, Ava P and Yang, Kevin K and Alamdari, Sarah and Wang, Chentong and Abbasi-Asl, Reza},
  booktitle = {ICLR 2025 Workshops: GEM},
  year      = {2025},
  url       = {https://mlanthology.org/iclrw/2025/lee2025iclrw-structurebased/}
}