What are some algorithms for comparing how similar two strings are?

I need to compare strings to decide whether they represent the same thing. This relates to case titles entered by humans where abbreviations and other small details may differ. For example, consider the following two titles:

std::string first = "Henry C. Harper v. The Law Offices of Huey & Luey, LLP";

As opposed to:

std::string second = "Harper v. The Law Offices of Huey & Luey, LLP";

A human can quickly gauge that these are most likely one and the same. The current approach I have taken is to normalize the strings by lowercasing all letters and removing all punctuation and spaces giving:

std::string firstNormalized = "henrycharpervthelawofficesofhueylueyllp";

And:

std::string secondNormalized = "harpervthelawofficesofhueylueyllp";

Comparing in this case, one is a sub-sequence of the other, but you can imagine other more complex variations where that does not necessarily occur, yet they have significant sub-sequences in common. There could also be occasional human entry errors such as transposed letters and spelling errors.

Perhaps some kind of character diff program could help? I've seen good line diff programs for comparing differences in code to be checked in, is there something like that on a character basis, maybe in boost? If you could count the number of consecutive characters in common and take the ratio to the characters unshared, perhaps that would be a good heuristic?

In the end, I need a Boolean decision as to whether to consider them the same or not. It doesn't have to be perfect, but it should ideally rarely be wrong.

What algorithm can I use that will give me some kind of quantification as to how similar the two strings are to each other which I can then convert into a yes/no answer by way of some heuristic?

55642 次浏览

What you're looking for are called String Metric algorithms. There a significant number of them, many with similar characteristics. Among the more popular:

  • Levenshtein Distance : The minimum number of single-character edits required to change one word into the other. Strings do not have to be the same length
  • Hamming Distance : The number of characters that are different in two equal length strings.
  • Smith–Waterman : A family of algorithms for computing variable sub-sequence similarities.
  • Sørensen–Dice Coefficient : A similarity algorithm that computes difference coefficients of adjacent character pairs.

Have a look at these as well as others on the wiki page on the topic.

Damerau Levenshtein distance is another algorithm for comparing two strings and it is similar to the Levenshtein distance algorithm. The difference between the two is that it can also check transpositions between characters and hence may give a better result for error correction.

For example: The Levenshtein distance between night and nigth is 2 but Damerau Levenshtein distance between night and nigth will be 1 because it is just a swap of a pair of characters.

You could use ngrams for that. For example, transform the two strings in word trigrams (usually lowercase) and compare the percentage of them that are equal to one another.

Your challenge is to define a minimum percentage for similarity.

http://en.wikipedia.org/wiki/N-gram

Another algorithm that you can consider is the Simon White Similarity:

def get_bigrams(string):
"""
Take a string and return a list of bigrams.
"""
if string is None:
return ""


s = string.lower()
return [s[i : i + 2] for i in list(range(len(s) - 1))]
def simon_similarity(str1, str2):
"""
Perform bigram comparison between two strings
and return a percentage match in decimal form.
"""
pairs1 = get_bigrams(str1)
pairs2 = get_bigrams(str2)
union = len(pairs1) + len(pairs2)


if union == 0 or union is None:
return 0


hit_count = 0
for x in pairs1:
for y in pairs2:
if x == y:
hit_count += 1
break
return (2.0 * hit_count) / union

You may use the algorithm of computing the length of the longest common sub-sequence to solve the problem. If the length of the longest common sub-sequence for both the input strings is less than the length of either of the strings, they are unequal.

You may use the approach of dynamic programming to solve the problem and optimize the space complexity as well in case you don't wish to figure out the longest common sub-sequence.

The string similarity metrics mentioned here can all work. However, further normalization of your text in very specific ways can lead to much better results.

TL;DR

From my experience in speech recognition and handwriting recognition, I think your problem will be best solved by using Text Normalization (archived) followed by a Word Error Rate (archived). A good, quick overview of this process is given at an Amazon AWS Machine Learning Blog post on Evaluating Speech Recognition (archived). A good (somewhat standard) tool for doing both parts of the process (normalization and scoring) is NIST's SCTK. First use rfilter1 for the text normalization, then use sclite to get the score. Figure out based on the score which strings you consider to match.

Further details

I think that there are three areas of study/application which the problems faced are very similar to your problem. They are: 1) Speech Recognition (archived) (a domain "where abbreviations and other small details may differ"); and in the related solutions of 2) Optical Character Recognition (archived) and 3) Handwriting Recognition (archived) (both domains "where [data is] entered by humans [and] where abbreviations and other small details may differ".).

It's especially useful to look at the scoring of automated or human transcribers/recognizers and at the problem of searching for strings in any such transcription. From my experience in these domains, it seems that the best similarity comparisons come from Levenshtein Distance (archived) using words instead of characters for finding the edit distance; this is called the Word Error Rate. A normalization of text, including case, punctuation, and such things as abbreviations, makes the comparison even better.

A Quick Example

It seems you're using C or C++. sclite and rfilter1 are mostly written in C. I hope that an example using bash+sclite will suffice.

Contents of law.glm, a VERY minimal GLM file (GLobal Mapping file, i.e. pairs of search and replace rules)

;;
* name "law.glm"
* desc "Showing extra normalization"
* format = "NIST1"     ;; other option is NIST2
* max_nrules = "1000"  ;; allocating space (I can update this if necessary)
* copy_no_hit = "T"    ;; don't ignore the line if there isn't a match
* case_sensitive = "F"


. => / __ [ ] ;; changes only if there's a space after it
, => / __ [ ]
? => / __ [ ]
! => / __ [ ]
versus => v / [ ] __ [ ] ;; changes only if there's a space before & after
vs => v / [ ] __ [ ]
& => and / [ ] __ [ ]
llp => limited liability partnership / [ ] __ [ ]
llc => limited liability company / [ ] __ [ ]
it's => it is / [ ] __ [ ]
shoppe => shop / [ ] __ [ ]
mister => Mr / [ ] __ [ ]

Now, in bash.

$ first="Henry C. Harper v. The Law Offices of Huey & Luey, LLP (spk1_1)"
$ second="Harper v. The Law Offices of Huey & Luey, LLP (spk1_1)"
#
# sclite requires actual files.
# It also requires something after the string (an ID, which has been
#+ put in as '(spk1_1)', don't worry about the details.)
#
$ echo "${first}" > first.txt
$ echo "${second}" > second.txt
#
# normalization
$ rfilter1 law.glm < first.txt > first_glm_normalized.txt
$ tr [A-Z] [a-z] < first_glm_normalized.txt > first_normalized.txt
$ rfilter1 law.glm < second.txt > second_glm_normalized.txt
$ tr [A-Z] [a-z] < second_glm_normalized.txt > second_normalized.txt
#
# Run the scorer (similarity finder)
$ sclite -r first_normalized.txt -h second_normalized.txt -i rm -o snt stdout
===============================================================================


SENTENCE LEVEL REPORT FOR THE SYSTEM:
Name: second_normalized.txt


===============================================================================




SPEAKER spk1
id: (spk1_1)
Scores: (#C #S #D #I) 12 0 2 0
REF:  HENRY C harper v the law offices of huey and luey limited liability partnership
HYP:  ***** * harper v the law offices of huey and luey limited liability partnership
Eval: D     D


Correct               =   85.7%   12   ( 12)
Substitutions         =    0.0%    0   (  0)
Deletions             =   14.3%    2   (  2)
Insertions            =    0.0%    0   (  0)


Errors                =   14.3%    2   (  2)


Ref. words            =           14   ( 14)
Hyp. words            =           12   ( 12)
Aligned words         =           14   ( 14)


-------------------------------------------------------------------------------




$ # That `-i rm' has to do with that ID we talked about

So, there's a 14.3% Word Error Rate.

Now, let's look at a law case name that shouldn't match.

$ third="Larry Viola versus The Law Office of Mister Scrooge McDuck, Limited Liability Corporation (spk1_1)"
$ echo "${third}" > third.txt
$ rfilter1 law.glm < third.txt > third_glm_normalized.txt
$ tr [A-Z] [a-z] < third_glm_normalized.txt > third_normalized.txt
#
$ sclite -r first_normalized.txt -h third_normalized.txt -i rm -o snt stdout
$ sclite -r first_normalized.txt -h third_normalized.txt -i rm -o snt stdout    ===============================================================================


SENTENCE LEVEL REPORT FOR THE SYSTEM:
Name: third_normalized.txt


===============================================================================




SPEAKER spk1
id: (spk1_1)
Scores: (#C #S #D #I) 6 7 1 0
REF:  HENRY C     HARPER v the law OFFICES of HUEY AND     LUEY   limited liability PARTNERSHIP
HYP:  ***** LARRY VIOLA  v the law OFFICE  of MR   SCROOGE MCDUCK limited liability CORPORATION
Eval: D     S     S                S          S    S       S                        S


Correct               =   42.9%    6   (  6)
Substitutions         =   50.0%    7   (  7)
Deletions             =    7.1%    1   (  1)
Insertions            =    0.0%    0   (  0)


Errors                =   57.1%    8   (  8)


Ref. words            =           14   ( 14)
Hyp. words            =           13   ( 13)
Aligned words         =           14   ( 14)


-------------------------------------------------------------------------------




$ # This one has a 57.1% Word Error Rate

You'll likely need to run some of your strings through the scoring (comparison) process to come up with a heuristic of where to separate True from False.