Java 中的相似字符串比较

我想比较几个字符串,找出最相似的。我想知道是否有任何库、方法或最佳实践可以返回哪些字符串与其他字符串更相似。例如:

  • “快狐狸跳”-> “狐狸跳”
  • “敏捷的狐狸跳”-> “狐狸”

这种比较会得出结论,前者比后者更为相似。

我想我需要一些方法,比如:

double similarityIndex(String s1, String s2)

有这种东西吗?

编辑: 我为什么要这么做?我正在编写一个脚本,将 MS 项目文件的输出与处理任务的一些遗留系统的输出进行比较。由于遗留系统的字段宽度非常有限,因此添加值时描述会缩短。我想要一些半自动化的方法来查找从 MS 项目的条目是类似的条目在系统上,所以我可以得到生成的关键。它有缺点,因为它仍然需要手动检查,但它将节省大量的工作

142275 次浏览

你可以用莱文斯坦距离来计算两个字符串之间的差值。 Http://en.wikipedia.org/wiki/levenshtein_distance

理论上,你可以比较 编辑距离

是的,有很多有据可查的算法,比如:

  • 余弦距离
  • Jaccard 相似性
  • 骰子系数
  • 相似性匹配
  • 重叠的相似性
  • 等等

一个很好的总结(“ Sam 的字符串度量”) 可以在这里找到(原始链接死亡,所以它链接到互联网档案馆)

还要检查这些项目:

这通常是使用 编辑距离度量来完成的。搜索“编辑距离 java”会出现很多库,比如 这个

如果你的字符串变成了文档,听起来像是 抄袭检测器抄袭检测器。也许用这个词搜索会有好结果。

“编程集体智能”有一章是关于确定两个文档是否相似的。代码是 Python 编写的,但是很干净,很容易移植。

我把 莱文斯坦距离算法翻译成了 JavaScript:

String.prototype.LevenshteinDistance = function (s2) {
var array = new Array(this.length + 1);
for (var i = 0; i < this.length + 1; i++)
array[i] = new Array(s2.length + 1);


for (var i = 0; i < this.length + 1; i++)
array[i][0] = i;
for (var j = 0; j < s2.length + 1; j++)
array[0][j] = j;


for (var i = 1; i < this.length + 1; i++) {
for (var j = 1; j < s2.length + 1; j++) {
if (this[i - 1] == s2[j - 1]) array[i][j] = array[i - 1][j - 1];
else {
array[i][j] = Math.min(array[i][j - 1] + 1, array[i - 1][j] + 1);
array[i][j] = Math.min(array[i][j], array[i - 1][j - 1] + 1);
}
}
}
return array[this.length][s2.length];
};

在许多库中使用的 以0% -100% 的方式计算两个字符串之间的相似度的常用方法是,测量需要将长字符串更改为短字符串的程度(以% 为单位) :

/**
* Calculates the similarity (a number within 0 and 1) between two strings.
*/
public static double similarity(String s1, String s2) {
String longer = s1, shorter = s2;
if (s1.length() < s2.length()) { // longer should always have greater length
longer = s2; shorter = s1;
}
int longerLength = longer.length();
if (longerLength == 0) { return 1.0; /* both strings are zero length */ }
return (longerLength - editDistance(longer, shorter)) / (double) longerLength;
}
// you can use StringUtils.getLevenshteinDistance() as the editDistance() function
// full copy-paste working code is below


计算 editDistance():

上面的 editDistance()函数将计算两个字符串之间的 编辑距离。有 几个实现到这一步,每个可能更适合特定的场景。最常见的是 莱文斯坦距离算法,我们将在下面的示例中使用它(对于非常大的字符串,其他算法可能性能更好)。

下面是两个计算编辑距离的选项:


实例:

点击这里查看在线演示。

public class StringSimilarity {


/**
* Calculates the similarity (a number within 0 and 1) between two strings.
*/
public static double similarity(String s1, String s2) {
String longer = s1, shorter = s2;
if (s1.length() < s2.length()) { // longer should always have greater length
longer = s2; shorter = s1;
}
int longerLength = longer.length();
if (longerLength == 0) { return 1.0; /* both strings are zero length */ }
/* // If you have Apache Commons Text, you can use it to calculate the edit distance:
LevenshteinDistance levenshteinDistance = new LevenshteinDistance();
return (longerLength - levenshteinDistance.apply(longer, shorter)) / (double) longerLength; */
return (longerLength - editDistance(longer, shorter)) / (double) longerLength;


}


// Example implementation of the Levenshtein Edit Distance
// See http://rosettacode.org/wiki/Levenshtein_distance#Java
public static int editDistance(String s1, String s2) {
s1 = s1.toLowerCase();
s2 = s2.toLowerCase();


int[] costs = new int[s2.length() + 1];
for (int i = 0; i <= s1.length(); i++) {
int lastValue = i;
for (int j = 0; j <= s2.length(); j++) {
if (i == 0)
costs[j] = j;
else {
if (j > 0) {
int newValue = costs[j - 1];
if (s1.charAt(i - 1) != s2.charAt(j - 1))
newValue = Math.min(Math.min(newValue, lastValue),
costs[j]) + 1;
costs[j - 1] = lastValue;
lastValue = newValue;
}
}
}
if (i > 0)
costs[s2.length()] = lastValue;
}
return costs[s2.length()];
}


public static void printSimilarity(String s, String t) {
System.out.println(String.format(
"%.3f is the similarity between \"%s\" and \"%s\"", similarity(s, t), s, t));
}


public static void main(String[] args) {
printSimilarity("", "");
printSimilarity("1234567890", "1");
printSimilarity("1234567890", "123");
printSimilarity("1234567890", "1234567");
printSimilarity("1234567890", "1234567890");
printSimilarity("1234567890", "1234567980");
printSimilarity("47/2010", "472010");
printSimilarity("47/2010", "472011");
printSimilarity("47/2010", "AB.CDEF");
printSimilarity("47/2010", "4B.CDEFG");
printSimilarity("47/2010", "AB.CDEFG");
printSimilarity("The quick fox jumped", "The fox jumped");
printSimilarity("The quick fox jumped", "The fox");
printSimilarity("kitten", "sitting");
}


}

产出:

1.000 is the similarity between "" and ""
0.100 is the similarity between "1234567890" and "1"
0.300 is the similarity between "1234567890" and "123"
0.700 is the similarity between "1234567890" and "1234567"
1.000 is the similarity between "1234567890" and "1234567890"
0.800 is the similarity between "1234567890" and "1234567980"
0.857 is the similarity between "47/2010" and "472010"
0.714 is the similarity between "47/2010" and "472011"
0.000 is the similarity between "47/2010" and "AB.CDEF"
0.125 is the similarity between "47/2010" and "4B.CDEFG"
0.000 is the similarity between "47/2010" and "AB.CDEFG"
0.700 is the similarity between "The quick fox jumped" and "The fox jumped"
0.350 is the similarity between "The quick fox jumped" and "The fox"
0.571 is the similarity between "kitten" and "sitting"

感谢第一个答案,我认为有2个计算 computeEditRange (s1,s2)。由于它的时间消耗很大,决定提高代码的性能。所以:

public class LevenshteinDistance {


public static int computeEditDistance(String s1, String s2) {
s1 = s1.toLowerCase();
s2 = s2.toLowerCase();


int[] costs = new int[s2.length() + 1];
for (int i = 0; i <= s1.length(); i++) {
int lastValue = i;
for (int j = 0; j <= s2.length(); j++) {
if (i == 0) {
costs[j] = j;
} else {
if (j > 0) {
int newValue = costs[j - 1];
if (s1.charAt(i - 1) != s2.charAt(j - 1)) {
newValue = Math.min(Math.min(newValue, lastValue),
costs[j]) + 1;
}
costs[j - 1] = lastValue;
lastValue = newValue;
}
}
}
if (i > 0) {
costs[s2.length()] = lastValue;
}
}
return costs[s2.length()];
}


public static void printDistance(String s1, String s2) {
double similarityOfStrings = 0.0;
int editDistance = 0;
if (s1.length() < s2.length()) { // s1 should always be bigger
String swap = s1;
s1 = s2;
s2 = swap;
}
int bigLen = s1.length();
editDistance = computeEditDistance(s1, s2);
if (bigLen == 0) {
similarityOfStrings = 1.0; /* both strings are zero length */
} else {
similarityOfStrings = (bigLen - editDistance) / (double) bigLen;
}
//////////////////////////
//System.out.println(s1 + "-->" + s2 + ": " +
//      editDistance + " (" + similarityOfStrings + ")");
System.out.println(editDistance + " (" + similarityOfStrings + ")");
}


public static void main(String[] args) {
printDistance("", "");
printDistance("1234567890", "1");
printDistance("1234567890", "12");
printDistance("1234567890", "123");
printDistance("1234567890", "1234");
printDistance("1234567890", "12345");
printDistance("1234567890", "123456");
printDistance("1234567890", "1234567");
printDistance("1234567890", "12345678");
printDistance("1234567890", "123456789");
printDistance("1234567890", "1234567890");
printDistance("1234567890", "1234567980");


printDistance("47/2010", "472010");
printDistance("47/2010", "472011");


printDistance("47/2010", "AB.CDEF");
printDistance("47/2010", "4B.CDEFG");
printDistance("47/2010", "AB.CDEFG");


printDistance("The quick fox jumped", "The fox jumped");
printDistance("The quick fox jumped", "The fox");
printDistance("The quick fox jumped",
"The quick fox jumped off the balcany");
printDistance("kitten", "sitting");
printDistance("rosettacode", "raisethysword");
printDistance(new StringBuilder("rosettacode").reverse().toString(),
new StringBuilder("raisethysword").reverse().toString());
for (int i = 1; i < args.length; i += 2) {
printDistance(args[i - 1], args[i]);
}




}
}

确实有很多字符串相似性度量标准:

  • 编辑距离;
  • 达默罗-列文施泰因距离;
  • Jaro-Winkler 相似性;
  • 最长公共子序列编辑距离;
  • Q-Gram (Ukkonen) ;
  • N-克距离(Kondrak) ;
  • 雅卡指数;
  • 索伦森-戴斯系数;
  • 余弦距离
  • ...

你可以在这里找到它们的解释和 Java 实现: Https://github.com/tdebatty/java-string-similarity

您可以使用 Apache commons 文本库来实现这一点:


上述不推荐的说法:

Apache commons java 库 -> 距离 < a href = “ https://comms.apache.org/per/commons-lang/apidocs/org/apache/commons/lang3/StringUtils.html # getFuzzy

您也可以使用 z 算法来查找字符串中的相似性

你可以在没有任何库的情况下使用这个“莱文斯坦距离”算法:

 public static int getLevenshteinDistance(CharSequence s, CharSequence t) {
if (s == null || t == null) {throw new IllegalArgumentException("Strings must not be null");}
int n = s.length();
int m = t.length();


if (n == 0) {
return m;
}
else if (m == 0) {
return n;
}


if (n > m) {
// swap the input strings to consume less memory
final CharSequence tmp = s;
s = t;
t = tmp;
n = m;
m = t.length();
}


final int[] p = new int[n + 1];
// indexes into strings s and t
int i; // iterates through s
int j; // iterates through t
int upper_left;
int upper;


char t_j; // jth character of t
int cost;


for (i = 0; i <= n; i++) {
p[i] = i;
}


for (j = 1; j <= m; j++) {
upper_left = p[0];
t_j = t.charAt(j - 1);
p[0] = j;


for (i = 1; i <= n; i++) {
upper = p[i];
cost = s.charAt(i - 1) == t_j ? 0 : 1;
// minimum of cell to the left+1, to the top+1, diagonally left and up +cost
p[i] = Math.min(Math.min(p[i - 1] + 1, p[i] + 1), upper_left + cost);
upper_left = upper;
}
}


return p[n];
}

从这里