Fast Kernels for Inexact String Matching
Abstract
We introduce several new families of string kernels designed in particular for use with support vector machines (SVMs) for classification of protein sequence data. These kernels – restricted gappy kernels, substitution kernels, and wildcard kernels – are based on feature spaces indexed by k -length subsequences from the string alphabet Σ (or the alphabet augmented by a wildcard character), and hence they are related to the recently presented ( k , m )-mismatch kernel and string kernels used in text classification. However, for all kernels we define here, the kernel value K ( x , y ) can be computed in O ( c _ K (| x | + | y |)) time, where the constant c _ K depends on the parameters of the kernel but is independent of the size |Σ| of the alphabet. Thus the computation of these kernels is linear in the length of the sequences, like the mismatch kernel, but we improve upon the parameter-dependent constant $c_K = k^{m+1} |\Sigma|^m$ of the mismatch kernel. We compute the kernels efficiently using a recursive function based on a trie data structure and relate our new kernels to the recently described transducer formalism. Finally, we report protein classification experiments on a benchmark SCOP dataset, where we show that our new faster kernels achieve SVM classification performance comparable to the mismatch kernel and the Fisher kernel derived from profile hidden Markov models.
Cite
Text
Leslie and Kuang. "Fast Kernels for Inexact String Matching." Annual Conference on Computational Learning Theory, 2003. doi:10.1007/978-3-540-45167-9_10Markdown
[Leslie and Kuang. "Fast Kernels for Inexact String Matching." Annual Conference on Computational Learning Theory, 2003.](https://mlanthology.org/colt/2003/leslie2003colt-fast/) doi:10.1007/978-3-540-45167-9_10BibTeX
@inproceedings{leslie2003colt-fast,
title = {{Fast Kernels for Inexact String Matching}},
author = {Leslie, Christina S. and Kuang, Rui},
booktitle = {Annual Conference on Computational Learning Theory},
year = {2003},
pages = {114-128},
doi = {10.1007/978-3-540-45167-9_10},
url = {https://mlanthology.org/colt/2003/leslie2003colt-fast/}
}