ProSST: Protein Language Modeling with Quantized Structure and Disentangled Attention

Abstract

Protein language models (PLMs) have shown remarkable capabilities in various protein function prediction tasks. However, while protein function is intricately tied to structure, most existing PLMs do not incorporate protein structure information. To address this issue, we introduce ProSST, a Transformer-based protein language model that seamlessly integrates both protein sequences and structures. ProSST incorporates a structure quantization module and a Transformer architecture with disentangled attention. The structure quantization module translates a 3D protein structure into a sequence of discrete tokens by first serializing the protein structure into residue-level local structures and then embeds them into dense vector space. These vectors are then quantized into discrete structure tokens by a pre-trained clustering model. These tokens serve as an effective protein structure representation. Furthermore, ProSST explicitly learns the relationship between protein residue token sequences and structure token sequences through the sequence-structure disentangled attention. We pre-train ProSST on millions of protein structures using a masked language model objective, enabling it to learn comprehensive contextual representations of proteins. To evaluate the proposed ProSST, we conduct extensive experiments on the zero-shot mutation effect prediction and several supervised downstream tasks, where ProSST achieves the state-of-the-art performance among all baselines. Our code and pre-trained models are publicly available.

Cite

Text

Li et al. "ProSST: Protein Language Modeling with Quantized Structure and Disentangled Attention." Neural Information Processing Systems, 2024. doi:10.52202/079017-1126

Markdown

[Li et al. "ProSST: Protein Language Modeling with Quantized Structure and Disentangled Attention." Neural Information Processing Systems, 2024.](https://mlanthology.org/neurips/2024/li2024neurips-prosst/) doi:10.52202/079017-1126

BibTeX

@inproceedings{li2024neurips-prosst,
  title     = {{ProSST: Protein Language Modeling with Quantized Structure and Disentangled Attention}},
  author    = {Li, Mingchen and Tan, Yang and Ma, Xinzhu and Zhong, Bozitao and Yu, Huiqun and Zhou, Ziyi and Ouyang, Wanli and Zhou, Bingxin and Tan, Pan and Hong, Liang},
  booktitle = {Neural Information Processing Systems},
  year      = {2024},
  doi       = {10.52202/079017-1126},
  url       = {https://mlanthology.org/neurips/2024/li2024neurips-prosst/}
}