GaKCo: A Fast Gapped K-Mer String Kernel Using Counting

Abstract

String Kernel (SK) techniques, especially those using gapped k -mers as features (gk), have obtained great success in classifying sequences like DNA, protein, and text. However, the state-of-the-art gk-SK runs extremely slow when we increase the dictionary size ( $\varSigma $ ) or allow more mismatches ( M ). This is because current gk-SK uses a trie-based algorithm to calculate co-occurrence of mismatched substrings resulting in a time cost proportional to $O(\varSigma ^{M})$ . We propose a fast algorithm for calculating Ga pped k -mer K ernel using Co unting (GaKCo). GaKCo uses associative arrays to calculate the co-occurrence of substrings using cumulative counting. This algorithm is fast, scalable to larger $\varSigma $ and M , and naturally parallelizable. We provide a rigorous asymptotic analysis that compares GaKCo with the state-of-the-art gk-SK. Theoretically, the time cost of GaKCo is independent of the $\varSigma ^{M}$ term that slows down the trie-based approach. Experimentally, we observe that GaKCo achieves the same accuracy as the state-of-the-art and outperforms its speed by factors of 2, 100, and 4, on classifying sequences of DNA (5 datasets), protein (12 datasets), and character-based English text (2 datasets). (GaKCo is shared as an open source tool at https://github.com/QData/GaKCo-SVM ). Code and data related to this chapter are available at: https://doi.org/10.6084/m9.figshare.5434825 .

Cite

Text

Singh et al. "GaKCo: A Fast Gapped K-Mer String Kernel Using Counting." European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, 2017. doi:10.1007/978-3-319-71249-9_22

Markdown

[Singh et al. "GaKCo: A Fast Gapped K-Mer String Kernel Using Counting." European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, 2017.](https://mlanthology.org/ecmlpkdd/2017/singh2017ecmlpkdd-gakco/) doi:10.1007/978-3-319-71249-9_22

BibTeX

@inproceedings{singh2017ecmlpkdd-gakco,
  title     = {{GaKCo: A Fast Gapped K-Mer String Kernel Using Counting}},
  author    = {Singh, Ritambhara and Sekhon, Arshdeep and Kowsari, Kamran and Lanchantin, Jack and Wang, Beilun and Qi, Yanjun},
  booktitle = {European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases},
  year      = {2017},
  pages     = {356-373},
  doi       = {10.1007/978-3-319-71249-9_22},
  url       = {https://mlanthology.org/ecmlpkdd/2017/singh2017ecmlpkdd-gakco/}
}