Harry: A Tool for Measuring String Similarity
Abstract
Comparing strings and assessing their similarity is a basic operation in many application domains of machine learning, such as in information retrieval, natural language processing and bioinformatics. The practitioner can choose from a large variety of available similarity measures for this task, each emphasizing different aspects of the string data. In this article, we present Harry, a small tool specifically designed for measuring the similarity of strings. Harry implements over 20 similarity measures, including common string distances and string kernels, such as the Levenshtein distance and the Subsequence kernel. The tool has been designed with efficiency in mind and allows for multi-threaded as well as distributed computing, enabling the analysis of large data sets of strings. Harry supports common data formats and thus can interface with analysis environments, such as Matlab, Pylab and Weka.
Cite
Text
Rieck and Wressnegger. "Harry: A Tool for Measuring String Similarity." Machine Learning Open Source Software, 2016.Markdown
[Rieck and Wressnegger. "Harry: A Tool for Measuring String Similarity." Machine Learning Open Source Software, 2016.](https://mlanthology.org/mloss/2016/rieck2016jmlr-harry/)BibTeX
@article{rieck2016jmlr-harry,
title = {{Harry: A Tool for Measuring String Similarity}},
author = {Rieck, Konrad and Wressnegger, Christian},
journal = {Machine Learning Open Source Software},
year = {2016},
pages = {1-5},
volume = {17},
url = {https://mlanthology.org/mloss/2016/rieck2016jmlr-harry/}
}