Discrete Profile Alignment via Constrained Information Bottleneck

Abstract

Amino acid profiles, which capture position-specific mutation prob- abilities, are a richer encoding of biological sequences than the in- dividual sequences themselves. However, profile comparisons are much more computationally expensive than discrete symbol com- parisons, making profiles impractical for many large datasets. Fur- thermore, because they are such a rich representation, profiles can be difficult to visualize. To overcome these problems, we propose a discretization for profiles using an expanded alphabet representing not just individual amino acids, but common profiles. By using an extension of information bottleneck (IB) incorporating constraints and priors on the class distributions, we find an informationally optimal alphabet. This discretization yields a concise, informative textual representation for profile sequences. Also alignments be- tween these sequences, while nearly as accurate as the full profile- profile alignments, can be computed almost as quickly as those between individual or consensus sequences. A full pairwise align- ment of SwissProt would take years using profiles, but less than 3 days using a discrete IB encoding, illustrating how discrete en- coding can expand the range of sequence problems to which profile information can be applied.

Cite

Text

O'rourke et al. "Discrete Profile Alignment via Constrained Information Bottleneck." Neural Information Processing Systems, 2004.

Markdown

[O'rourke et al. "Discrete Profile Alignment via Constrained Information Bottleneck." Neural Information Processing Systems, 2004.](https://mlanthology.org/neurips/2004/orourke2004neurips-discrete/)

BibTeX

@inproceedings{orourke2004neurips-discrete,
  title     = {{Discrete Profile Alignment via Constrained Information Bottleneck}},
  author    = {O'rourke, Sean and Chechik, Gal and Friedman, Robin and Eskin, Eleazar},
  booktitle = {Neural Information Processing Systems},
  year      = {2004},
  pages     = {1009-1016},
  url       = {https://mlanthology.org/neurips/2004/orourke2004neurips-discrete/}
}