FoldToken: Learning Protein Language via Vector Quantization and Beyond

Gao, Zhangyang; Tan, Cheng; Wang, Jue; Huang, Yufei; Wu, Lirong; Li, Stan Z.

doi:10.1609/AAAI.V39I1.31998

FoldToken: Learning Protein Language via Vector Quantization and Beyond

Zhangyang Gao, Cheng Tan, Jue Wang, Yufei Huang, Lirong Wu, Stan Z. Li

AAAI 2025 pp. 219-227

doi:10.1609/AAAI.V39I1.31998 /aaai/2025/gao2025aaai-foldtoken/

Abstract

Is there a foreign language describing protein sequences and structures simultaneously? Protein structures, represented by continuous 3D points, have long posed a challenge due to the contrasting modeling paradigms of discrete sequences. We introduce FoldTokenizer to represent protein sequence-structure as discrete symbols. This approach involves projecting residue types and structures into a discrete space, guided by a reconstruction loss for information preservation. We name the learned discrete symbols as FoldToken, and the sequence of FoldTokens serves as a new protein language, transforming the protein sequence-structure into a unified modality. We apply the created protein language on general backbone inpainting task, building the first GPT-style model (FoldGPT) for sequence-structure co-generation with promising results. Key to our success is the substantial enhancement of the vector quantization module, Soft Conditional Vector Quantization (SoftCVQ).

PDF AAAI Semantic Scholar

Cite

Text

Gao et al. "FoldToken: Learning Protein Language via Vector Quantization and Beyond." AAAI Conference on Artificial Intelligence, 2025. doi:10.1609/AAAI.V39I1.31998

Markdown

[Gao et al. "FoldToken: Learning Protein Language via Vector Quantization and Beyond." AAAI Conference on Artificial Intelligence, 2025.](https://mlanthology.org/aaai/2025/gao2025aaai-foldtoken/) doi:10.1609/AAAI.V39I1.31998

BibTeX

@inproceedings{gao2025aaai-foldtoken,
  title     = {{FoldToken: Learning Protein Language via Vector Quantization and Beyond}},
  author    = {Gao, Zhangyang and Tan, Cheng and Wang, Jue and Huang, Yufei and Wu, Lirong and Li, Stan Z.},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2025},
  pages     = {219-227},
  doi       = {10.1609/AAAI.V39I1.31998},
  url       = {https://mlanthology.org/aaai/2025/gao2025aaai-foldtoken/}
}