Distilling Structural Representations into Protein Sequence Models
Abstract
Protein language models, like the popular ESM2, are widely used tools for extracting evolution-based protein representations and have achieved significant success on downstream biological tasks. Representations based on sequence and structure models, however, show significant performance differences depending on the downstream task. A major open problem is to obtain representations that best capture both the evolutionary and structural properties of proteins in general. Here we introduce Implicit Structure Model (ISM), a sequence-only input model with structurally-enriched representations that outperforms state-of-the-art sequence models on several well-studied benchmarks including mutation stability assessment and structure prediction. Our key innovations are a microenvironment-based autoencoder for generating structure tokens and a self-supervised training objective that distills these tokens into ESM2's pre-trained model. We have made ISM's structure-enriched weights easily available: integrating ISM into any application using ESM2 requires changing only a single line of code. Our code is available at \url{https://github.com/jozhang97/ISM}.
Cite
Text
Ouyang-Zhang et al. "Distilling Structural Representations into Protein Sequence Models." NeurIPS 2024 Workshops: AIDrugX, 2024.Markdown
[Ouyang-Zhang et al. "Distilling Structural Representations into Protein Sequence Models." NeurIPS 2024 Workshops: AIDrugX, 2024.](https://mlanthology.org/neuripsw/2024/ouyangzhang2024neuripsw-distilling/)BibTeX
@inproceedings{ouyangzhang2024neuripsw-distilling,
title = {{Distilling Structural Representations into Protein Sequence Models}},
author = {Ouyang-Zhang, Jeffrey and Gong, Chengyue and Zhao, Yue and Kraehenbuehl, Philipp and Klivans, Adam and Diaz, Daniel Jesus},
booktitle = {NeurIPS 2024 Workshops: AIDrugX},
year = {2024},
url = {https://mlanthology.org/neuripsw/2024/ouyangzhang2024neuripsw-distilling/}
}