Distilling Structural Representations into Protein Sequence Models

Abstract

Protein language (or sequence) models, like the popular ESM2, are now widely used tools for extracting evolution-based protein representations and have achieved significant success on core downstream biological tasks. A major open problem is how to obtain representations that best capture both the sequence evolutionary history and the atomic structural properties of proteins in general. We introduce **I**mplicit **S**equence **M**odel, a sequence-only input model with structurally-enriched representations that outperforms state-of-the-art sequence models on several well-studied benchmarks including mutation stability assessment and structure prediction. Our key innovations are a microenvironment-based Autoencoder for generating structure tokens and a self-supervised training objective that distills these tokens into ESM2's pre-trained model. Notably, we make ISM's structure-enriched weights easily accessible for any application using the ESM2 framework.

Cite

Text

Ouyang-Zhang et al. "Distilling Structural Representations into Protein Sequence Models." International Conference on Learning Representations, 2025.

Markdown

[Ouyang-Zhang et al. "Distilling Structural Representations into Protein Sequence Models." International Conference on Learning Representations, 2025.](https://mlanthology.org/iclr/2025/ouyangzhang2025iclr-distilling/)

BibTeX

@inproceedings{ouyangzhang2025iclr-distilling,
  title     = {{Distilling Structural Representations into Protein Sequence Models}},
  author    = {Ouyang-Zhang, Jeffrey and Gong, Chengyue and Zhao, Yue and Kraehenbuehl, Philipp and Klivans, Adam and Diaz, Daniel Jesus},
  booktitle = {International Conference on Learning Representations},
  year      = {2025},
  url       = {https://mlanthology.org/iclr/2025/ouyangzhang2025iclr-distilling/}
}