Java 中的相似字符串比较

小开

你可以用莱文斯坦距离来计算两个字符串之间的差值。 Http://en.wikipedia.org/wiki/levenshtein_distance

小开

理论上，你可以比较编辑距离。

小开

最佳答案

是的，有很多有据可查的算法，比如:

余弦距离
Jaccard 相似性
骰子系数
相似性匹配
重叠的相似性
等等

一个很好的总结(“ Sam 的字符串度量”) 可以在这里找到(原始链接死亡，所以它链接到互联网档案馆)

还要检查这些项目:

小开

这通常是使用编辑距离度量来完成的。搜索“编辑距离 java”会出现很多库，比如这个。

小开

如果你的字符串变成了文档，听起来像是抄袭检测器抄袭检测器。也许用这个词搜索会有好结果。

“编程集体智能”有一章是关于确定两个文档是否相似的。代码是 Python 编写的，但是很干净，很容易移植。

小开

我把莱文斯坦距离算法翻译成了 JavaScript:

String.prototype.LevenshteinDistance = function (s2) {
var array = new Array(this.length + 1);
for (var i = 0; i < this.length + 1; i++)
array[i] = new Array(s2.length + 1);


for (var i = 0; i < this.length + 1; i++)
array[i][0] = i;
for (var j = 0; j < s2.length + 1; j++)
array[0][j] = j;


for (var i = 1; i < this.length + 1; i++) {
for (var j = 1; j < s2.length + 1; j++) {
if (this[i - 1] == s2[j - 1]) array[i][j] = array[i - 1][j - 1];
else {
array[i][j] = Math.min(array[i][j - 1] + 1, array[i - 1][j] + 1);
array[i][j] = Math.min(array[i][j], array[i - 1][j - 1] + 1);
}
}
}
return array[this.length][s2.length];
};

小开

在许多库中使用的 以0% -100% 的方式计算两个字符串之间的相似度的常用方法是，测量需要将长字符串更改为短字符串的程度(以% 为单位) :

/**
* Calculates the similarity (a number within 0 and 1) between two strings.
*/
public static double similarity(String s1, String s2) {
String longer = s1, shorter = s2;
if (s1.length() < s2.length()) { // longer should always have greater length
longer = s2; shorter = s1;
}
int longerLength = longer.length();
if (longerLength == 0) { return 1.0; /* both strings are zero length */ }
return (longerLength - editDistance(longer, shorter)) / (double) longerLength;
}
// you can use StringUtils.getLevenshteinDistance() as the editDistance() function
// full copy-paste working code is below

计算 `editDistance()`:

上面的 editDistance()函数将计算两个字符串之间的 编辑距离。有几个实现到这一步，每个可能更适合特定的场景。最常见的是 莱文斯坦距离算法，我们将在下面的示例中使用它(对于非常大的字符串，其他算法可能性能更好)。

下面是两个计算编辑距离的选项:

你可以使用 阿帕奇公共文本 的莱文斯坦距离: apply(CharSequence left, CharSequence rightt)
在您自己的系统中实现它。

实例:

点击这里查看在线演示。

public class StringSimilarity {


/**
* Calculates the similarity (a number within 0 and 1) between two strings.
*/
public static double similarity(String s1, String s2) {
String longer = s1, shorter = s2;
if (s1.length() < s2.length()) { // longer should always have greater length
longer = s2; shorter = s1;
}
int longerLength = longer.length();
if (longerLength == 0) { return 1.0; /* both strings are zero length */ }
/* // If you have Apache Commons Text, you can use it to calculate the edit distance:
LevenshteinDistance levenshteinDistance = new LevenshteinDistance();
return (longerLength - levenshteinDistance.apply(longer, shorter)) / (double) longerLength; */
return (longerLength - editDistance(longer, shorter)) / (double) longerLength;


}


// Example implementation of the Levenshtein Edit Distance
// See http://rosettacode.org/wiki/Levenshtein_distance#Java
public static int editDistance(String s1, String s2) {
s1 = s1.toLowerCase();
s2 = s2.toLowerCase();


int[] costs = new int[s2.length() + 1];
for (int i = 0; i <= s1.length(); i++) {
int lastValue = i;
for (int j = 0; j <= s2.length(); j++) {
if (i == 0)
costs[j] = j;
else {
if (j > 0) {
int newValue = costs[j - 1];
if (s1.charAt(i - 1) != s2.charAt(j - 1))
newValue = Math.min(Math.min(newValue, lastValue),
costs[j]) + 1;
costs[j - 1] = lastValue;
lastValue = newValue;
}
}
}
if (i > 0)
costs[s2.length()] = lastValue;
}
return costs[s2.length()];
}


public static void printSimilarity(String s, String t) {
System.out.println(String.format(
"%.3f is the similarity between \"%s\" and \"%s\"", similarity(s, t), s, t));
}


public static void main(String[] args) {
printSimilarity("", "");
printSimilarity("1234567890", "1");
printSimilarity("1234567890", "123");
printSimilarity("1234567890", "1234567");
printSimilarity("1234567890", "1234567890");
printSimilarity("1234567890", "1234567980");
printSimilarity("47/2010", "472010");
printSimilarity("47/2010", "472011");
printSimilarity("47/2010", "AB.CDEF");
printSimilarity("47/2010", "4B.CDEFG");
printSimilarity("47/2010", "AB.CDEFG");
printSimilarity("The quick fox jumped", "The fox jumped");
printSimilarity("The quick fox jumped", "The fox");
printSimilarity("kitten", "sitting");
}


}

产出:

1.000 is the similarity between "" and ""
0.100 is the similarity between "1234567890" and "1"
0.300 is the similarity between "1234567890" and "123"
0.700 is the similarity between "1234567890" and "1234567"
1.000 is the similarity between "1234567890" and "1234567890"
0.800 is the similarity between "1234567890" and "1234567980"
0.857 is the similarity between "47/2010" and "472010"
0.714 is the similarity between "47/2010" and "472011"
0.000 is the similarity between "47/2010" and "AB.CDEF"
0.125 is the similarity between "47/2010" and "4B.CDEFG"
0.000 is the similarity between "47/2010" and "AB.CDEFG"
0.700 is the similarity between "The quick fox jumped" and "The fox jumped"
0.350 is the similarity between "The quick fox jumped" and "The fox"
0.571 is the similarity between "kitten" and "sitting"

小开

感谢第一个答案，我认为有2个计算 computeEditRange (s1，s2)。由于它的时间消耗很大，决定提高代码的性能。所以:

public class LevenshteinDistance {


public static int computeEditDistance(String s1, String s2) {
s1 = s1.toLowerCase();
s2 = s2.toLowerCase();


int[] costs = new int[s2.length() + 1];
for (int i = 0; i <= s1.length(); i++) {
int lastValue = i;
for (int j = 0; j <= s2.length(); j++) {
if (i == 0) {
costs[j] = j;
} else {
if (j > 0) {
int newValue = costs[j - 1];
if (s1.charAt(i - 1) != s2.charAt(j - 1)) {
newValue = Math.min(Math.min(newValue, lastValue),
costs[j]) + 1;
}
costs[j - 1] = lastValue;
lastValue = newValue;
}
}
}
if (i > 0) {
costs[s2.length()] = lastValue;
}
}
return costs[s2.length()];
}


public static void printDistance(String s1, String s2) {
double similarityOfStrings = 0.0;
int editDistance = 0;
if (s1.length() < s2.length()) { // s1 should always be bigger
String swap = s1;
s1 = s2;
s2 = swap;
}
int bigLen = s1.length();
editDistance = computeEditDistance(s1, s2);
if (bigLen == 0) {
similarityOfStrings = 1.0; /* both strings are zero length */
} else {
similarityOfStrings = (bigLen - editDistance) / (double) bigLen;
}
//////////////////////////
//System.out.println(s1 + "-->" + s2 + ": " +
//      editDistance + " (" + similarityOfStrings + ")");
System.out.println(editDistance + " (" + similarityOfStrings + ")");
}


public static void main(String[] args) {
printDistance("", "");
printDistance("1234567890", "1");
printDistance("1234567890", "12");
printDistance("1234567890", "123");
printDistance("1234567890", "1234");
printDistance("1234567890", "12345");
printDistance("1234567890", "123456");
printDistance("1234567890", "1234567");
printDistance("1234567890", "12345678");
printDistance("1234567890", "123456789");
printDistance("1234567890", "1234567890");
printDistance("1234567890", "1234567980");


printDistance("47/2010", "472010");
printDistance("47/2010", "472011");


printDistance("47/2010", "AB.CDEF");
printDistance("47/2010", "4B.CDEFG");
printDistance("47/2010", "AB.CDEFG");


printDistance("The quick fox jumped", "The fox jumped");
printDistance("The quick fox jumped", "The fox");
printDistance("The quick fox jumped",
"The quick fox jumped off the balcany");
printDistance("kitten", "sitting");
printDistance("rosettacode", "raisethysword");
printDistance(new StringBuilder("rosettacode").reverse().toString(),
new StringBuilder("raisethysword").reverse().toString());
for (int i = 1; i < args.length; i += 2) {
printDistance(args[i - 1], args[i]);
}




}
}

小开

确实有很多字符串相似性度量标准:

编辑距离;
达默罗-列文施泰因距离;
Jaro-Winkler 相似性;
最长公共子序列编辑距离;
Q-Gram (Ukkonen) ;
N-克距离(Kondrak) ;
雅卡指数;
索伦森-戴斯系数;
余弦距离
...

你可以在这里找到它们的解释和 Java 实现: Https://github.com/tdebatty/java-string-similarity

小开

您可以使用 Apache commons 文本库来实现这一点:

上述不推荐的说法:

Apache commons java 库 -> 距离 < a href = “ https://comms.apache.org/per/commons-lang/apidocs/org/apache/commons/lang3/StringUtils.html # getFuzzy

小开

您也可以使用 z 算法来查找字符串中的相似性

小开

你可以在没有任何库的情况下使用这个“莱文斯坦距离”算法:

 public static int getLevenshteinDistance(CharSequence s, CharSequence t) {
if (s == null || t == null) {throw new IllegalArgumentException("Strings must not be null");}
int n = s.length();
int m = t.length();


if (n == 0) {
return m;
}
else if (m == 0) {
return n;
}


if (n > m) {
// swap the input strings to consume less memory
final CharSequence tmp = s;
s = t;
t = tmp;
n = m;
m = t.length();
}


final int[] p = new int[n + 1];
// indexes into strings s and t
int i; // iterates through s
int j; // iterates through t
int upper_left;
int upper;


char t_j; // jth character of t
int cost;


for (i = 0; i <= n; i++) {
p[i] = i;
}


for (j = 1; j <= m; j++) {
upper_left = p[0];
t_j = t.charAt(j - 1);
p[0] = j;


for (i = 1; i <= n; i++) {
upper = p[i];
cost = s.charAt(i - 1) == t_j ? 0 : 1;
// minimum of cell to the left+1, to the top+1, diagonally left and up +cost
p[i] = Math.min(Math.min(p[i - 1] + 1, p[i] + 1), upper_left + cost);
upper_left = upper;
}
}


return p[n];
}

从这里

Java 中的相似字符串比较

计算 editDistance():

实例:

计算 `editDistance()`: