Structure-Based Synthetic Data Augmentation for Protein Language Models
Abstract
The goal of $\textit{de novo}$ protein design is to leverage natural proteins to design new ones. Deep generative models of protein structure and sequence are the two dominant $\textit{de novo}$ design paradigms. Structure-based models can produce highly novel proteins, but are constrained by data to produce proteins with a narrow range of topologies. Sequence-based design models produce more natural samples over a wider range of topologies, but with reduced novelty. Here, we propose a structure-based synthetic data augmentation approach to combine the benefits of structure and sequence in generative models of proteins. We generated and characterized 240,830 $\textit{de novo}$ backbone structures and used these backbones to generate 45 million sequences for data augmentation. Models trained with structure-based synthetic data augmentation generate a shifted distribution of proteins that are more likely to express successfully in $\textit{E. coli}$ and are more thermostable. We release the trained models as well as our complete synthetic dataset, BackboneRef.
Cite
Text
Lee et al. "Structure-Based Synthetic Data Augmentation for Protein Language Models." ICLR 2025 Workshops: GEM, 2025.Markdown
[Lee et al. "Structure-Based Synthetic Data Augmentation for Protein Language Models." ICLR 2025 Workshops: GEM, 2025.](https://mlanthology.org/iclrw/2025/lee2025iclrw-structurebased/)BibTeX
@inproceedings{lee2025iclrw-structurebased,
title = {{Structure-Based Synthetic Data Augmentation for Protein Language Models}},
author = {Lee, Alex Jihun and Amini, Ava P and Yang, Kevin K and Alamdari, Sarah and Wang, Chentong and Abbasi-Asl, Reza},
booktitle = {ICLR 2025 Workshops: GEM},
year = {2025},
url = {https://mlanthology.org/iclrw/2025/lee2025iclrw-structurebased/}
}