Jaro-Winkler 和莱文斯坦距离的区别?

我要对多个文件中的数百万条记录进行模糊匹配。我为此确定了两种算法: Jaro-WinklerLevenshtein编辑距离。

我不能理解这两者之间的区别。看起来 Levenshtein给出了两个字符串之间的编辑次数,而 Jaro-Winkler给出了0.0到1.0之间的标准化得分。

我的问题是:

  1. 这两种算法的基本区别是什么?

  2. 这两种算法的性能差别是什么?

76555 次浏览

Levenshtein counts the number of edits (insertions, deletions, or substitutions) needed to convert one string to the other. Damerau-Levenshtein is a modified version that also considers transpositions as single edits. Although the output is the integer number of edits, this can be normalized to give a similarity value by the formula

1 - (edit distance / length of the larger of the two strings)

The Jaro algorithm is a measure of characters in common, being no more than half the length of the longer string in distance, with consideration for transpositions. Winkler modified this algorithm to support the idea that differences near the start of the string are more significant than differences near the end of the string. Jaro and Jaro-Winkler are suited for comparing smaller strings like words and names.

Deciding which to use is not just a matter of performance. It's important to pick a method that is suited to the nature of the strings you are comparing. In general though, both of the algorithms you mentioned can be expensive, because each string must be compared to every other string, and with millions of strings in your data set, that is a tremendous number of comparisons. That is much more expensive than something like computing a phonetic encoding for each string, and then simply grouping strings sharing identical encodings.

There is a wealth of detailed information on these algorithms and other fuzzy string matching algorithms on the internet. This one will give you a start:

A Comparison of Personal Name Matching: Techniques and Practical Issues

According to that paper, the speed of the four Jaro and Levenshtein algorithms I've mentioned are from fastest to slowest:

  • Jaro
  • Jaro-Winkler
  • Levenshtein
  • Damerau-Levenshtein

with the slowest taking 2 to 3 times as long as the fastest. Of course these times are dependent on the lengths of the strings and the implementations, and there are ways to optimize these algorithms that may not have been used.