Sequence-Based Protein Models for the Prediction of Mutations Across Priority Viruses
Abstract
Viruses pose a significant threat to human health. Advances in machine learning for predicting mutation effects have enhanced viral surveillance and enabled the proactive design of vaccines and therapeutics, but the accuracy of these methods across priority viruses remain unclear. We perform the first large-scale modeling across 40 WHO priority pandemic-threat pathogens, many of which are under-surveilled, discovering that most have sufficient sequence or structural information for effective modeling, highlighting the potential for using these approaches in pandemic preparedness. To understand the limits of current modeling capabilities for viruses, we curate 47 standardized viral deep mutational scanning assays to systematically evaluate the performance of three alignment-based models, three Protein Language Models (PLMs), and two structure-aware PLMs with different training databases. We find marked differences in performance of these models on viruses relative to non-viral proteins. For viral proteins, we find alignment-based models perform on par with PLMs though with predictable differences in which model is better for a particular function or virus depending on data available. We define confidence metrics for both alignment-based models and PLMs that indicate when additional sequence or structural data may be needed for accurate predictions and to guide model selection in the absence of available data for evaluation. We use these metrics to inform the development a confidence-weighted hybrid model that builds on the strength of each approach, adapts to the quality of data available, and outperforms either of the best alignment or PLM models alone.
Cite
Text
Gurev et al. "Sequence-Based Protein Models for the Prediction of Mutations Across Priority Viruses." ICLR 2025 Workshops: GEM, 2025.Markdown
[Gurev et al. "Sequence-Based Protein Models for the Prediction of Mutations Across Priority Viruses." ICLR 2025 Workshops: GEM, 2025.](https://mlanthology.org/iclrw/2025/gurev2025iclrw-sequencebased/)BibTeX
@inproceedings{gurev2025iclrw-sequencebased,
title = {{Sequence-Based Protein Models for the Prediction of Mutations Across Priority Viruses}},
author = {Gurev, Sarah and Youssef, Noor and Jain, Navami and Marks, Debora Susan},
booktitle = {ICLR 2025 Workshops: GEM},
year = {2025},
url = {https://mlanthology.org/iclrw/2025/gurev2025iclrw-sequencebased/}
}