Diffusion Language Models Are Versatile Protein Learners

Abstract

This paper introduces diffusion protein language model (DPLM), a versatile protein language model that demonstrates strong generative and predictive capabilities for protein sequences. We first pre-train scalable DPLMs from evolutionary-scale protein sequences within a generative self-supervised discrete diffusion probabilistic framework, which generalizes language modeling for proteins in a principled way. After pre-training, DPLM exhibits the ability to generate structurally plausible, novel and diverse protein sequences for unconditional generation. We further demonstrate the proposed diffusion generative pre-training make DPLM possess a better understanding of proteins, making it a superior representation learner, which can be fine-tuned for various predictive tasks, comparing favorably to ESM2. Moreover, DPLM can be tailored for various needs, which showcases its prowess of conditional generation in several ways: (1) conditioning on partial peptide sequences, e.g., generating scaffolds for functional motifs with high success rate; (2) incorporating other modalities as conditioners, e.g., structure-conditioned generation for inverse folding; and (3) steering sequence generation towards desired properties, e.g., satisfying specified secondary structures, through a plug-and-play classifier guidance.

Cite

Text

Wang et al. "Diffusion Language Models Are Versatile Protein Learners." International Conference on Machine Learning, 2024.

Markdown

[Wang et al. "Diffusion Language Models Are Versatile Protein Learners." International Conference on Machine Learning, 2024.](https://mlanthology.org/icml/2024/wang2024icml-diffusion/)

BibTeX

@inproceedings{wang2024icml-diffusion,
  title     = {{Diffusion Language Models Are Versatile Protein Learners}},
  author    = {Wang, Xinyou and Zheng, Zaixiang and Ye, Fei and Xue, Dongyu and Huang, Shujian and Gu, Quanquan},
  booktitle = {International Conference on Machine Learning},
  year      = {2024},
  pages     = {52309-52333},
  volume    = {235},
  url       = {https://mlanthology.org/icml/2024/wang2024icml-diffusion/}
}