Evaluating Prompt Tuning for Conditional Protein Sequence Generation
Abstract
Text generation models originally developed for natural language processing have proven to be successful in generating protein sequences. These models are often finetuned for improved performance on more specific tasks, such as generation of proteins from families unseen in training. Considering the high computational cost of finetuning separate models for each downstream task, prompt tuning has been proposed as an alternative. However, no openly available implementation of this approach compatible with protein language models has been previously published. Thus, we adapt an open-source codebase designed for NLP models to build a pipeline for prompt tuning on protein sequence data, supporting the protein language models ProtGPT2 and RITA. We evaluate our implementation by learning prompts for conditional sampling of sequences belonging to a specific protein family. This results in improved performance compared to the base model. However, in the presented use case, we observe discrepancies between text-based evaluation and predicted biological properties of the generated sequences, identifying open problems for principled assessment of protein sequence generation quality.
Cite
Text
Nathansen et al. "Evaluating Prompt Tuning for Conditional Protein Sequence Generation." ICLR 2023 Workshops: MLDD, 2023.Markdown
[Nathansen et al. "Evaluating Prompt Tuning for Conditional Protein Sequence Generation." ICLR 2023 Workshops: MLDD, 2023.](https://mlanthology.org/iclrw/2023/nathansen2023iclrw-evaluating/)BibTeX
@inproceedings{nathansen2023iclrw-evaluating,
title = {{Evaluating Prompt Tuning for Conditional Protein Sequence Generation}},
author = {Nathansen, Andrea and Klein, Kevin and Renard, Bernhard Y and Nowicka, Melania and Bartoszewicz, Jakub M},
booktitle = {ICLR 2023 Workshops: MLDD},
year = {2023},
url = {https://mlanthology.org/iclrw/2023/nathansen2023iclrw-evaluating/}
}