Learning Chomsky-like Grammars for Biological Sequence Families

Abstract

This paper presents a new method of measuring performance when positives are rare and investigates whether Chomsky-like grammar representations are useful for learning accurate comprehensible predictors of members of biological sequence families. The positive-only learning framework of the Inductive Logic Programming (ILP) system CProgol is used to generate a grammar for recognising a class of proteins known as human neuropeptide precursors (NPPs). As far as these authors are aware, this is both the first biological grammar learnt using ILP and the first real-world scientific application of the positive-only learning framework of CProgol. Performance is measured using both predictive accuracy and a new cost function, em Relative Advantage (RA). The RA results show that searching for NPPs by using our best NPP predictor as a filter is more than 100 times more efficient than randomly selecting proteins for synthesis and testing them for biological activity. The highest RA was achieved by a model which includes grammar-derived features. This RA is significantly higher than the best RA achieved without the use of the grammar-derived features.

Cite

Text

Muggleton et al. "Learning Chomsky-like Grammars for Biological Sequence Families." International Conference on Machine Learning, 2000.

Markdown

[Muggleton et al. "Learning Chomsky-like Grammars for Biological Sequence Families." International Conference on Machine Learning, 2000.](https://mlanthology.org/icml/2000/muggleton2000icml-learning/)

BibTeX

@inproceedings{muggleton2000icml-learning,
  title     = {{Learning Chomsky-like Grammars for Biological Sequence Families}},
  author    = {Muggleton, Stephen H. and Bryant, Christopher H. and Srinivasan, Ashwin},
  booktitle = {International Conference on Machine Learning},
  year      = {2000},
  pages     = {631-638},
  url       = {https://mlanthology.org/icml/2000/muggleton2000icml-learning/}
}