Protein Language Models Are Biased by Unequal Sequence Sampling Across the Tree of Life
Abstract
Protein language models (pLMs) trained on large protein sequence databases have been used to understand disease and design novel proteins. In design tasks, the likelihood of a protein sequence under a pLM is often used as a proxy for protein fitness, so it is critical to understand what signals likelihoods capture. In this work we find that pLM likelihoods unintentionally encode a species bias: likelihoods of protein sequences from certain species are systematically higher, independent of the protein in question. We quantify this bias and show that it arises in large part because of unequal species representation in popular protein sequence databases. We further show that the bias can be detrimental for some protein design applications, such as enhancing thermostability. These results highlight the importance of understanding and curating pLM training data to mitigate biases and improve protein design capabilities in under-explored parts of sequence space.
Cite
Text
Ding and Steinhardt. "Protein Language Models Are Biased by Unequal Sequence Sampling Across the Tree of Life." ICLR 2024 Workshops: GEM, 2024.Markdown
[Ding and Steinhardt. "Protein Language Models Are Biased by Unequal Sequence Sampling Across the Tree of Life." ICLR 2024 Workshops: GEM, 2024.](https://mlanthology.org/iclrw/2024/ding2024iclrw-protein/)BibTeX
@inproceedings{ding2024iclrw-protein,
title = {{Protein Language Models Are Biased by Unequal Sequence Sampling Across the Tree of Life}},
author = {Ding, Frances and Steinhardt, Jacob},
booktitle = {ICLR 2024 Workshops: GEM},
year = {2024},
url = {https://mlanthology.org/iclrw/2024/ding2024iclrw-protein/}
}