Structure-Aware Language Models Trained on Ultra-Mega-Scale Metagenomic Data Improve Protein Folding Stability Prediction
Abstract
Predicting absolute protein stability remains challenging due to the limited availability of experimental datasets and the intricate interplay between sequence and structure contributions to protein stability. In this study, we experimentally measured the folding stability of 2 million high-quality, diverse metagenomic MGnify sequences using high-throughput cDNA display methods. This dataset includes 814,000 wild-type (WT) proteins and sequences with point mutations and insertions/deletions. We fine-tuned the structure-based large language models, Saprot and ESM-3, using LoRA (Low-Rank Adapter) on stability measurements, achieving a Spearman correlation of 0.87 on the MGnify test dataset. Our results demonstrate that these models can predict absolute folding stability from both insertions/deletions and mutational effects, even in non-cDNA datasets covering a wide stability range, including large proteins.
Cite
Text
Cho et al. "Structure-Aware Language Models Trained on Ultra-Mega-Scale Metagenomic Data Improve Protein Folding Stability Prediction." ICLR 2025 Workshops: GEM, 2025.Markdown
[Cho et al. "Structure-Aware Language Models Trained on Ultra-Mega-Scale Metagenomic Data Improve Protein Folding Stability Prediction." ICLR 2025 Workshops: GEM, 2025.](https://mlanthology.org/iclrw/2025/cho2025iclrw-structureaware/)BibTeX
@inproceedings{cho2025iclrw-structureaware,
title = {{Structure-Aware Language Models Trained on Ultra-Mega-Scale Metagenomic Data Improve Protein Folding Stability Prediction}},
author = {Cho, Yehlin and Tsuboyama, Kotaro and Rocklin, Gabriel J. and Ovchinnikov, Sergey},
booktitle = {ICLR 2025 Workshops: GEM},
year = {2025},
url = {https://mlanthology.org/iclrw/2025/cho2025iclrw-structureaware/}
}