Jaro-Winkler 和莱文斯坦距离的区别？

小开

最佳答案

Levenshtein counts the number of edits (insertions, deletions, or substitutions) needed to convert one string to the other. Damerau-Levenshtein is a modified version that also considers transpositions as single edits. Although the output is the integer number of edits, this can be normalized to give a similarity value by the formula

1 - (edit distance / length of the larger of the two strings)

The Jaro algorithm is a measure of characters in common, being no more than half the length of the longer string in distance, with consideration for transpositions. Winkler modified this algorithm to support the idea that differences near the start of the string are more significant than differences near the end of the string. Jaro and Jaro-Winkler are suited for comparing smaller strings like words and names.

Deciding which to use is not just a matter of performance. It's important to pick a method that is suited to the nature of the strings you are comparing. In general though, both of the algorithms you mentioned can be expensive, because each string must be compared to every other string, and with millions of strings in your data set, that is a tremendous number of comparisons. That is much more expensive than something like computing a phonetic encoding for each string, and then simply grouping strings sharing identical encodings.

There is a wealth of detailed information on these algorithms and other fuzzy string matching algorithms on the internet. This one will give you a start:

A Comparison of Personal Name Matching: Techniques and Practical Issues

According to that paper, the speed of the four Jaro and Levenshtein algorithms I've mentioned are from fastest to slowest:

Jaro
Jaro-Winkler
Levenshtein
Damerau-Levenshtein

with the slowest taking 2 to 3 times as long as the fastest. Of course these times are dependent on the lengths of the strings and the implementations, and there are ways to optimize these algorithms that may not have been used.