Exploring Sequence Landscape of Biosynthetic Gene Clusters with Protein Language Models
Abstract
Many organisms, such as bacteria, fungi, and plants, produce intricate chemicals that are not needed for their growth and reproduction, and thus are called secondary metabolites or natural products (NPs). NPs are a rich source of drugs, with most antibiotics being derivatives of NPs. In a producer organism, NPs are synthesized by a set of enzymes encoded by genes that often lie near each other on the chromosome and are called a biosynthetic gene cluster (BGC). In this work, we explore the capability of protein language models (PLMs) to produce meaningful representations of BGCs. We employ transfer learning to train models to predict the chemical class of the produced compound and explore the topological properties of the produced embeddings. The code is available at project's GitHub repository: https://github.com/kalininalab/NaturalPPLuM.
Cite
Text
Malygina and Kalinina. "Exploring Sequence Landscape of Biosynthetic Gene Clusters with Protein Language Models." ICML 2024 Workshops: ML4LMS, 2024.Markdown
[Malygina and Kalinina. "Exploring Sequence Landscape of Biosynthetic Gene Clusters with Protein Language Models." ICML 2024 Workshops: ML4LMS, 2024.](https://mlanthology.org/icmlw/2024/malygina2024icmlw-exploring/)BibTeX
@inproceedings{malygina2024icmlw-exploring,
title = {{Exploring Sequence Landscape of Biosynthetic Gene Clusters with Protein Language Models}},
author = {Malygina, Tatiana and Kalinina, Olga},
booktitle = {ICML 2024 Workshops: ML4LMS},
year = {2024},
url = {https://mlanthology.org/icmlw/2024/malygina2024icmlw-exploring/}
}