Augmenting Evolutionary Models with Structure-Based Retrieval
Abstract
Multiple Sequence Alignments (MSAs) are crucial in protein sequence analysis for identifying homologous proteins sharing a common evolutionary origin. However, traditional MSA search tools struggle to recover distantly related sequences that, despite low sequence similarity, exhibit high structural and functional resemblance—often missing in the so-called ‘midnight zone’ of protein similarity. To overcome these limitations, we propose the integration of structure similarity search tools to enhance the identification of homologous proteins. This approach utilizes Foldseek to search the AlphaFold database, aligning structurally similar proteins to construct Multiple Structure Alignments (MStructAs) alongside traditional MSAs. By combining these alignments, we develop family-specific generative models for protein fitness prediction, using diverse assays from the ProteinGym benchmarks. Our findings reveal that incorporating structure-based retrieval into MSAs significantly improves the performance of alignment-based methods, suggesting a robust hybrid retrieval strategy that harnesses both sequence and structure similarities.
Cite
Text
Huang et al. "Augmenting Evolutionary Models with Structure-Based Retrieval." ICML 2024 Workshops: ML4LMS, 2024.Markdown
[Huang et al. "Augmenting Evolutionary Models with Structure-Based Retrieval." ICML 2024 Workshops: ML4LMS, 2024.](https://mlanthology.org/icmlw/2024/huang2024icmlw-augmenting/)BibTeX
@inproceedings{huang2024icmlw-augmenting,
title = {{Augmenting Evolutionary Models with Structure-Based Retrieval}},
author = {Huang, Yining and Zhang, Zuobai and Tang, Jian and Marks, Debora Susan and Notin, Pascal},
booktitle = {ICML 2024 Workshops: ML4LMS},
year = {2024},
url = {https://mlanthology.org/icmlw/2024/huang2024icmlw-augmenting/}
}