Computation of Similarity Measures for Sequential Data Using Generalized Suffix Trees

Abstract

We propose a generic algorithm for computation of similarity measures for se- quential data. The algorithm uses generalized suffix trees for efficient calculation of various kernel, distance and non-metric similarity functions. Its worst-case run-time is linear in the length of sequences and independent of the underlying embedding language, which can cover words, k-grams or all contained subse- quences. Experiments with network intrusion detection, DNA analysis and text processing applications demonstrate the utility of distances and similarity coeffi- cients for sequences as alternatives to classical kernel functions.

Cite

Text

Rieck et al. "Computation of Similarity Measures for Sequential Data Using Generalized Suffix Trees." Neural Information Processing Systems, 2006.

Markdown

[Rieck et al. "Computation of Similarity Measures for Sequential Data Using Generalized Suffix Trees." Neural Information Processing Systems, 2006.](https://mlanthology.org/neurips/2006/rieck2006neurips-computation/)

BibTeX

@inproceedings{rieck2006neurips-computation,
  title     = {{Computation of Similarity Measures for Sequential Data Using Generalized Suffix Trees}},
  author    = {Rieck, Konrad and Laskov, Pavel and Sonnenburg, Sören},
  booktitle = {Neural Information Processing Systems},
  year      = {2006},
  pages     = {1177-1184},
  url       = {https://mlanthology.org/neurips/2006/rieck2006neurips-computation/}
}