A New Ultra-High-Throughput Assay for Measuring Protein Fitness

Abstract

Machine learning (ML) for protein design frequently requires large datasets of protein fitness measurements generated by high-throughput experiments; however, publicly available protein fitness datasets generated by deep mutational scanning are noisy and only include $10^3$ to $10^5$ data points. In this work, we present DHARMA, a new protein fitness assay using molecular recording via base editors and high-throughput sequencing to measure the fitness of up to $10^6$ variants. To mitigate noise in DHARMA experiments, we design a Bayesian inference method FLIGHTED that denoises the output of a DHARMA experiment for downstream ML applications. Our results show that DHARMA and FLIGHTED can accurately measure protein fitness with calibrated errors. Using this technology, we generate a new fitness dataset of $160000$ TEV protease variants and benchmark a number of standard ML models, including protein language model embeddings, on this dataset. We find that data size is the single most important factor in determining ML model performance and that scaling up protein language models does not currently improve performance. DHARMA and FLIGHTED can help generate more large protein fitness datasets for the ML community.

Cite

Text

Sundar et al. "A New Ultra-High-Throughput Assay for Measuring Protein Fitness." ICLR 2024 Workshops: GEM, 2024.

Markdown

[Sundar et al. "A New Ultra-High-Throughput Assay for Measuring Protein Fitness." ICLR 2024 Workshops: GEM, 2024.](https://mlanthology.org/iclrw/2024/sundar2024iclrw-new/)

BibTeX

@inproceedings{sundar2024iclrw-new,
  title     = {{A New Ultra-High-Throughput Assay for Measuring Protein Fitness}},
  author    = {Sundar, Vikram and Tu, Boqiang and Guan, Lindsey and Esvelt, Kevin M.},
  booktitle = {ICLR 2024 Workshops: GEM},
  year      = {2024},
  url       = {https://mlanthology.org/iclrw/2024/sundar2024iclrw-new/}
}