Evolution-Inspired Loss Functions for Protein Representation Learning

Abstract

AI-based frameworks for protein engineering use self-supervised learning (SSL) to obtain representations for downstream biological predictions. The most common training objective for these methods is wildtype accuracy: given a sequence or structure where a wildtype residue has been masked, predict the missing amino acid. Wildtype accuracy, however, does not align with the primary goal of protein engineering, which is to suggest a {\em mutation} rather than to identify what already appears in nature. Here we present Evolutionary Ranking (EvoRank), a training objective that incorporates evolutionary information derived from multiple sequence alignments (MSAs) to learn more diverse protein representations. EvoRank corresponds to ranking amino-acid likelihoods in the probability distribution induced by an MSA. This objective forces models to learn the underlying evolutionary dynamics of a protein. Across a variety of phenotypes and datasets, we demonstrate that EvoRank leads to dramatic improvements in zero-shot performance and can compete with models fine-tuned on experimental data. This is particularly important in protein engineering, where it is expensive to obtain data for fine-tuning.

Cite

Text

Gong et al. "Evolution-Inspired Loss Functions for Protein Representation Learning." ICLR 2024 Workshops: GEM, 2024.

Markdown

[Gong et al. "Evolution-Inspired Loss Functions for Protein Representation Learning." ICLR 2024 Workshops: GEM, 2024.](https://mlanthology.org/iclrw/2024/gong2024iclrw-evolutioninspired/)

BibTeX

@inproceedings{gong2024iclrw-evolutioninspired,
  title     = {{Evolution-Inspired Loss Functions for Protein Representation Learning}},
  author    = {Gong, Chengyue and Klivans, Adam and Loy, James Madigan and Chen, Tianlong and Liu, Qiang and Diaz, Daniel Jesus},
  booktitle = {ICLR 2024 Workshops: GEM},
  year      = {2024},
  url       = {https://mlanthology.org/iclrw/2024/gong2024iclrw-evolutioninspired/}
}