Fine-Tuning Protein Language Models by Ranking Protein Fitness

Abstract

The self-supervised protein language models (pLMs) have demonstrated significant potential in predicting the impact of mutations on protein function and fitness, which is crucial for protein design. There are approaches to further condition pLM to language or multiple sequence alignment (MSA) to produce a protein of a specific family or function. However, most of those conditioning is too coarse-grained to express the function, and still exhibit a weak correlation to fitness and struggle to generate fit variants. To address this challenge, we propose a fine-tuning framework for pLM to align it to a specific fitness by ranking the mutants. We show that constructing the ranked pairs is crucial in fine-tuning pLMs, where we provide a simple yet effective method to improve fitness prediction across various datasets. Through experiments on ProteinGym, our method shows substantial improvements in the fitness prediction tasks even using less than 200 labeled data. Furthermore, we demonstrate that our approach excels in fitness optimization tasks.

Cite

Text

Lee et al. "Fine-Tuning Protein Language Models by Ranking Protein Fitness." NeurIPS 2023 Workshops: GenBio, 2023.

Markdown

[Lee et al. "Fine-Tuning Protein Language Models by Ranking Protein Fitness." NeurIPS 2023 Workshops: GenBio, 2023.](https://mlanthology.org/neuripsw/2023/lee2023neuripsw-finetuning/)

BibTeX

@inproceedings{lee2023neuripsw-finetuning,
  title     = {{Fine-Tuning Protein Language Models by Ranking Protein Fitness}},
  author    = {Lee, Minji and Lee, Kyungmin and Shin, Jinwoo},
  booktitle = {NeurIPS 2023 Workshops: GenBio},
  year      = {2023},
  url       = {https://mlanthology.org/neuripsw/2023/lee2023neuripsw-finetuning/}
}