Fast and Space Efficient String Kernels Using Suffix Arrays

Abstract

String kernels which compare the set of all common substrings between two given strings have recently been proposed by Vishwanathan & Smola (2004). Surprisingly, these kernels can be computed in linear time and linear space using annotated suffix trees. Even though, in theory, the suffix tree based algorithm requires O(n) space for an n length string, in practice at least 40n bytes are required - 20n bytes for storing the suffix tree, and an additional 20n bytes for the annotation. This large memory requirement coupled with poor locality of memory access, inherent due to the use of suffix trees, means that the performance of the suffix tree based algorithm deteriorates on large strings. In this paper, we describe a new linear time yet space efficient and scalable algorithm for computing string kernels, based on suffix arrays. Our algorithm is a) faster and easier to implement, b) on the average requires only 19n bytes of storage, and c) exhibits strong locality of memory access. We show that our algorithm can be extended to perform linear time prediction on a test string, and present experiments to validate our claims.

Cite

Text

Teo and Vishwanathan. "Fast and Space Efficient String Kernels Using Suffix Arrays." International Conference on Machine Learning, 2006. doi:10.1145/1143844.1143961

Markdown

[Teo and Vishwanathan. "Fast and Space Efficient String Kernels Using Suffix Arrays." International Conference on Machine Learning, 2006.](https://mlanthology.org/icml/2006/teo2006icml-fast/) doi:10.1145/1143844.1143961

BibTeX

@inproceedings{teo2006icml-fast,
  title     = {{Fast and Space Efficient String Kernels Using Suffix Arrays}},
  author    = {Teo, Choon Hui and Vishwanathan, S. V. N.},
  booktitle = {International Conference on Machine Learning},
  year      = {2006},
  pages     = {929-936},
  doi       = {10.1145/1143844.1143961},
  url       = {https://mlanthology.org/icml/2006/teo2006icml-fast/}
}