PAIR: Boosting the Predictive Power of Protein Representations with a Corpus of Text Annotations

Abstract

Protein language models trained on raw amino acid sequences have demonstrated impressive success in various protein function prediction tasks. One explanation for this success is that language modeling for amino acid sequences captures the local evolutionary fitness landscape and, therefore, encourages the models to extract rich information about the structure and function of a protein. Yet, detecting distant evolutionary relationships from sequences alone is a challenge. In this work, we conduct a comprehensive study examining the effects of training protein models on nineteen types of expertly-curated function annotations in Swiss-Prot. We find that different annotation types had varying effects on the quality of the learned representations, with some even degrading the model's performance. However, by incorporating a carefully-selected subset of annotation types, we are able to improve the model's function prediction performance. Notably, unlike existing protein models, our approach either matches or outperforms the widely-used bioinformatics tool BLAST in annotating previously uncharacterized proteins.

Cite

Text

Duan et al. "PAIR: Boosting the Predictive Power of Protein Representations with a Corpus of Text Annotations." ICML 2024 Workshops: AI4Science, 2024.

Markdown

[Duan et al. "PAIR: Boosting the Predictive Power of Protein Representations with a Corpus of Text Annotations." ICML 2024 Workshops: AI4Science, 2024.](https://mlanthology.org/icmlw/2024/duan2024icmlw-pair/)

BibTeX

@inproceedings{duan2024icmlw-pair,
  title     = {{PAIR: Boosting the Predictive Power of Protein Representations with a Corpus of Text Annotations}},
  author    = {Duan, Haonan and Skreta, Marta and Cotta, Leonardo and Rajaonson, Ella Miray and Dhawan, Nikita and Aspuru-Guzik, Alan and Maddison, Chris J.},
  booktitle = {ICML 2024 Workshops: AI4Science},
  year      = {2024},
  url       = {https://mlanthology.org/icmlw/2024/duan2024icmlw-pair/}
}