Multiple Sequence Alignment as a Sequence-to-Sequence Learning Problem

Edo Dotan, Yonatan Belinkov, Oren Avram, Elya Wygoda, Noa Ecker, Michael Alburquerque, Omri Keren, Gil Loewenthal, Tal Pupko

ICLR 2023

/iclr/2023/dotan2023iclr-multiple/

Abstract

The sequence alignment problem is one of the most fundamental problems in bioinformatics and a plethora of methods were devised to tackle it. Here we introduce BetaAlign, a methodology for aligning sequences using an NLP approach. BetaAlign accounts for the possible variability of the evolutionary process among different datasets by using an ensemble of transformers, each trained on millions of samples generated from a different evolutionary model. Our approach leads to alignment accuracy that is similar and often better than commonly used methods, such as MAFFT, DIALIGN, ClustalW, T-Coffee, PRANK, and MUSCLE.

PDF ICLR Semantic Scholar

Cite

Text

Dotan et al. "Multiple Sequence Alignment as a Sequence-to-Sequence Learning Problem." International Conference on Learning Representations, 2023.

Markdown

[Dotan et al. "Multiple Sequence Alignment as a Sequence-to-Sequence Learning Problem." International Conference on Learning Representations, 2023.](https://mlanthology.org/iclr/2023/dotan2023iclr-multiple/)

BibTeX

@inproceedings{dotan2023iclr-multiple,
  title     = {{Multiple Sequence Alignment as a Sequence-to-Sequence Learning Problem}},
  author    = {Dotan, Edo and Belinkov, Yonatan and Avram, Oren and Wygoda, Elya and Ecker, Noa and Alburquerque, Michael and Keren, Omri and Loewenthal, Gil and Pupko, Tal},
  booktitle = {International Conference on Learning Representations},
  year      = {2023},
  url       = {https://mlanthology.org/iclr/2023/dotan2023iclr-multiple/}
}