GaKCo: A Fast Gapped K-Mer String Kernel Using Counting
Abstract
String Kernel (SK) techniques, especially those using gapped k -mers as features (gk), have obtained great success in classifying sequences like DNA, protein, and text. However, the state-of-the-art gk-SK runs extremely slow when we increase the dictionary size ( $\varSigma $ ) or allow more mismatches ( M ). This is because current gk-SK uses a trie-based algorithm to calculate co-occurrence of mismatched substrings resulting in a time cost proportional to $O(\varSigma ^{M})$ . We propose a fast algorithm for calculating Ga pped k -mer K ernel using Co unting (GaKCo). GaKCo uses associative arrays to calculate the co-occurrence of substrings using cumulative counting. This algorithm is fast, scalable to larger $\varSigma $ and M , and naturally parallelizable. We provide a rigorous asymptotic analysis that compares GaKCo with the state-of-the-art gk-SK. Theoretically, the time cost of GaKCo is independent of the $\varSigma ^{M}$ term that slows down the trie-based approach. Experimentally, we observe that GaKCo achieves the same accuracy as the state-of-the-art and outperforms its speed by factors of 2, 100, and 4, on classifying sequences of DNA (5 datasets), protein (12 datasets), and character-based English text (2 datasets). (GaKCo is shared as an open source tool at https://github.com/QData/GaKCo-SVM ). Code and data related to this chapter are available at: https://doi.org/10.6084/m9.figshare.5434825 .
Cite
Text
Singh et al. "GaKCo: A Fast Gapped K-Mer String Kernel Using Counting." European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, 2017. doi:10.1007/978-3-319-71249-9_22Markdown
[Singh et al. "GaKCo: A Fast Gapped K-Mer String Kernel Using Counting." European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, 2017.](https://mlanthology.org/ecmlpkdd/2017/singh2017ecmlpkdd-gakco/) doi:10.1007/978-3-319-71249-9_22BibTeX
@inproceedings{singh2017ecmlpkdd-gakco,
title = {{GaKCo: A Fast Gapped K-Mer String Kernel Using Counting}},
author = {Singh, Ritambhara and Sekhon, Arshdeep and Kowsari, Kamran and Lanchantin, Jack and Wang, Beilun and Qi, Yanjun},
booktitle = {European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases},
year = {2017},
pages = {356-373},
doi = {10.1007/978-3-319-71249-9_22},
url = {https://mlanthology.org/ecmlpkdd/2017/singh2017ecmlpkdd-gakco/}
}