查找两个字符串之间的公共子字符串

我想比较2个字符串,并保持匹配,在比较失败的地方分开。

如果我有两个字符串:

string1 = "apples"
string2 = "appleses"


answer = "apples"

另一个例子,因为字符串可以有多个单词:

string1 = "apple pie available"
string2 = "apple pies"


answer = "apple pie"

我相信有一个简单的 Python 方法可以做到这一点,但我不能工作出来,任何帮助和解释感谢。

157644 次浏览
def common_start(sa, sb):
""" returns the longest common substring from the beginning of sa and sb """
def _iter():
for a, b in zip(sa, sb):
if a == b:
yield a
else:
return


return ''.join(_iter())
>>> common_start("apple pie available", "apple pies")
'apple pie'

Or a slightly stranger way:

def stop_iter():
"""An easy way to break out of a generator"""
raise StopIteration


def common_start(sa, sb):
return ''.join(a if a == b else stop_iter() for a, b in zip(sa, sb))

Which might be more readable as

def terminating(cond):
"""An easy way to break out of a generator"""
if cond:
return True
raise StopIteration


def common_start(sa, sb):
return ''.join(a for a, b in zip(sa, sb) if terminating(a == b))

Try:

import itertools as it
''.join(el[0] for el in it.takewhile(lambda t: t[0] == t[1], zip(string1, string2)))

It does the comparison from the beginning of both strings.

Its called Longest Common Substring problem. Here I present a simple, easy to understand but inefficient solution. It will take a long time to produce correct output for large strings, as the complexity of this algorithm is O(N^2).

def longestSubstringFinder(string1, string2):
answer = ""
len1, len2 = len(string1), len(string2)
for i in range(len1):
match = ""
for j in range(len2):
if (i + j < len1 and string1[i + j] == string2[j]):
match += string2[j]
else:
if (len(match) > len(answer)): answer = match
match = ""
return answer


print(longestSubstringFinder("apple pie available", "apple pies"))
print(longestSubstringFinder("apples", "appleses"))
print(longestSubstringFinder("bapples", "cappleses"))

Output

apple pie
apples
apples

Returns the first longest common substring:

def compareTwoStrings(string1, string2):
list1 = list(string1)
list2 = list(string2)


match = []
output = ""
length = 0


for i in range(0, len(list1)):


if list1[i] in list2:
match.append(list1[i])


for j in range(i + 1, len(list1)):


if ''.join(list1[i:j]) in string2:
match.append(''.join(list1[i:j]))


else:
continue
else:
continue


for string in match:


if length < len(list(string)):
length = len(list(string))
output = string


else:
continue


return output

This isn't the most efficient way to do it but it's what I could come up with and it works. If anyone can improve it, please do. What it does is it makes a matrix and puts 1 where the characters match. Then it scans the matrix to find the longest diagonal of 1s, keeping track of where it starts and ends. Then it returns the substring of the input string with the start and end positions as arguments.

Note: This only finds one longest common substring. If there's more than one, you could make an array to store the results in and return that Also, it's case sensitive so (Apple pie, apple pie) will return pple pie.

def longestSubstringFinder(str1, str2):
answer = ""


if len(str1) == len(str2):
if str1==str2:
return str1
else:
longer=str1
shorter=str2
elif (len(str1) == 0 or len(str2) == 0):
return ""
elif len(str1)>len(str2):
longer=str1
shorter=str2
else:
longer=str2
shorter=str1


matrix = numpy.zeros((len(shorter), len(longer)))


for i in range(len(shorter)):
for j in range(len(longer)):
if shorter[i]== longer[j]:
matrix[i][j]=1


longest=0


start=[-1,-1]
end=[-1,-1]
for i in range(len(shorter)-1, -1, -1):
for j in range(len(longer)):
count=0
begin = [i,j]
while matrix[i][j]==1:


finish=[i,j]
count=count+1
if j==len(longer)-1 or i==len(shorter)-1:
break
else:
j=j+1
i=i+1


i = i-count
if count>longest:
longest=count
start=begin
end=finish
break


answer=shorter[int(start[0]): int(end[0])+1]
return answer

The same as Evo's, but with arbitrary number of strings to compare:

def common_start(*strings):
""" Returns the longest common substring
from the beginning of the `strings`
"""
def _iter():
for z in zip(*strings):
if z.count(z[0]) == len(z):  # check all elements in `z` are the same
yield z[0]
else:
return


return ''.join(_iter())

First a helper function adapted from the itertools pairwise recipe to produce substrings.

import itertools
def n_wise(iterable, n = 2):
'''n = 2 -> (s0,s1), (s1,s2), (s2, s3), ...


n = 3 -> (s0,s1, s2), (s1,s2, s3), (s2, s3, s4), ...'''
a = itertools.tee(iterable, n)
for x, thing in enumerate(a[1:]):
for _ in range(x+1):
next(thing, None)
return zip(*a)

Then a function the iterates over substrings, longest first, and tests for membership. (efficiency not considered)

def foo(s1, s2):
'''Finds the longest matching substring
'''
# the longest matching substring can only be as long as the shortest string
#which string is shortest?
shortest, longest = sorted([s1, s2], key = len)
#iterate over substrings, longest substrings first
for n in range(len(shortest)+1, 2, -1):
for sub in n_wise(shortest, n):
sub = ''.join(sub)
if sub in longest:
#return the first one found, it should be the longest
return sub


s = "fdomainster"
t = "exdomainid"
print(foo(s,t))

>>>
domain
>>>

For completeness, difflib in the standard-library provides loads of sequence-comparison utilities. For instance find_longest_match which finds the longest common substring when used on strings. Example use:

from difflib import SequenceMatcher


string1 = "apple pie available"
string2 = "come have some apple pies"


match = SequenceMatcher(None, string1, string2).find_longest_match()


print(match)  # -> Match(a=0, b=15, size=9)
print(string1[match.a:match.a + match.size])  # -> apple pie
print(string2[match.b:match.b + match.size])  # -> apple pie

If you're using a version older than 3.9, you'need to call find_longest_match() with the following arguments:

SequenceMatcher(None, string1, string2).find_longest_match(0, len(string1), 0, len(string2))

Fix bugs with the first's answer:

def longestSubstringFinder(string1, string2):
answer = ""
len1, len2 = len(string1), len(string2)
for i in range(len1):
for j in range(len2):
lcs_temp = 0
match = ''
while ((i+lcs_temp < len1) and (j+lcs_temp<len2) and string1[i+lcs_temp] == string2[j+lcs_temp]):
match += string2[j+lcs_temp]
lcs_temp += 1
if len(match) > len(answer):
answer = match
return answer


print(longestSubstringFinder("dd apple pie available", "apple pies"))
print(longestSubstringFinder("cov_basic_as_cov_x_gt_y_rna_genes_w1000000", "cov_rna15pcs_as_cov_x_gt_y_rna_genes_w1000000")
print(longestSubstringFinder("bapples", "cappleses"))
print(longestSubstringFinder("apples", "apples"))
def matchingString(x,y):
match=''
for i in range(0,len(x)):
for j in range(0,len(y)):
k=1
# now applying while condition untill we find a substring match and length of substring is less than length of x and y
while (i+k <= len(x) and j+k <= len(y) and x[i:i+k]==y[j:j+k]):
if len(match) <= len(x[i:i+k]):
match = x[i:i+k]
k=k+1
return match


print matchingString('apple','ale') #le
print matchingString('apple pie available','apple pies') #apple pie
def LongestSubString(s1,s2):
left = 0
right =len(s2)
while(left<right):
if(s2[left] not in s1):
left = left+1
else:
if(s2[left:right] not in s1):
right = right - 1
else:
return(s2[left:right])


s1 = "pineapple"
s2 = "applc"
print(LongestSubString(s1,s2))

One might also consider os.path.commonprefix that works on characters and thus can be used for any strings.

import os
common = os.path.commonprefix(['apple pie available', 'apple pies'])
assert common == 'apple pie'

As the function name indicates, this only considers the common prefix of two strings.

This is the classroom problem called 'Longest sequence finder'. I have given some simple code that worked for me, also my inputs are lists of a sequence which can also be a string:

def longest_substring(list1,list2):
both=[]
if len(list1)>len(list2):
small=list2
big=list1
else:
small=list1
big=list2
removes=0
stop=0
for i in small:
for j in big:
if i!=j:
removes+=1
if stop==1:
break
elif i==j:
both.append(i)
for q in range(removes+1):
big.pop(0)
stop=1
break
removes=0
return both
**Return the comman longest substring**
def longestSubString(str1, str2):
longestString = ""
maxLength = 0
for i in range(0, len(str1)):
if str1[i] in str2:
for j in range(i + 1, len(str1)):
if str1[i:j] in str2:
if(len(str1[i:j]) > maxLength):
maxLength = len(str1[i:j])
longestString =  str1[i:j]
return longestString

A Trie data structure would work the best, better than DP. Here is the code.

class TrieNode:
def __init__(self):
self.child = [None]*26
self.endWord = False


class Trie:


def __init__(self):
self.root = self.getNewNode()


def getNewNode(self):
return TrieNode()


def insert(self,value):
root = self.root




for i,character in enumerate(value):
index = ord(character) - ord('a')
if not root.child[index]:
root.child[index] = self.getNewNode()
root = root.child[index]


root.endWord = True




def search(self,value):
root = self.root


for i,character in enumerate(value):
index = ord(character) - ord('a')
if not root.child[index]:
return False
root = root.child[index]
return root.endWord


def main():


# Input keys (use only 'a' through 'z' and lower case)
keys = ["the","anaswe"]
output = ["Not present in trie",
"Present in trie"]


# Trie object
t = Trie()


# Construct trie
for key in keys:
t.insert(key)


# Search for different keys
print("{} ---- {}".format("the",output[t.search("the")]))
print("{} ---- {}".format("these",output[t.search("these")]))
print("{} ---- {}".format("their",output[t.search("their")]))
print("{} ---- {}".format("thaw",output[t.search("thaw")]))


if __name__ == '__main__':
main()

Let me know in case of doubts.

In case we have a list of words that we need to find all common substrings I check some of the codes above and the best was https://stackoverflow.com/a/42882629/8520109 but it has some bugs for example 'histhome' and 'homehist'. In this case, we should have 'hist' and 'home' as a result. Furthermore, it differs if the order of arguments is changed. So I change the code to find every block of substring and it results a set of common substrings:

main = input().split(" ")    #a string of words separated by space
def longestSubstringFinder(string1, string2):
'''Find the longest matching word'''
answer = ""
len1, len2 = len(string1), len(string2)
for i in range(len1):
for j in range(len2):
lcs_temp=0
match=''
while ((i+lcs_temp < len1) and (j+lcs_temp<len2) and string1[i+lcs_temp] == string2[j+lcs_temp]):
match += string2[j+lcs_temp]
lcs_temp+=1
if (len(match) > len(answer)):
answer = match
return answer


def listCheck(main):
'''control the input for finding substring in a list of words'''
string1 = main[0]
result = []
for i in range(1, len(main)):
string2 = main[i]
res1 = longestSubstringFinder(string1, string2)
res2 = longestSubstringFinder(string2, string1)
result.append(res1)
result.append(res2)
result.sort()
return result


first_answer = listCheck(main)


final_answer  = []




for item1 in first_answer:    #to remove some incorrect match
string1 = item1
double_check = True
for item2 in main:
string2 = item2
if longestSubstringFinder(string1, string2) != string1:
double_check = False
if double_check:
final_answer.append(string1)


print(set(final_answer))

main = 'ABACDAQ BACDAQA ACDAQAW XYZCDAQ' #>>> {'CDAQ'}
main = 'homehist histhome' #>>> {'hist', 'home'}


This script requests you the minimum common substring length and gives all common substrings in two strings. Also, it eliminates shorter substrings that longer substrings include already.

def common_substrings(str1,str2):
len1,len2=len(str1),len(str2)


if len1 > len2:
str1,str2=str2,str1
len1,len2=len2,len1
#short string=str1 and long string=str2


min_com = int(input('Please enter the minumum common substring length:'))
    

cs_array=[]
for i in range(len1,min_com-1,-1):
for k in range(len1-i+1):
if (str1[k:i+k] in str2):
flag=1
for m in range(len(cs_array)):
if str1[k:i+k] in cs_array[m]:
#print(str1[k:i+k])
flag=0
break
if flag==1:
cs_array.append(str1[k:i+k])
if len(cs_array):
print(cs_array)
else:
print('There is no any common substring according to the parametres given')


common_substrings('ciguliuana','ciguana')
common_substrings('apples','appleses')
common_substrings('apple pie available','apple pies')

As if this question doesn't have enough answers, here's another option:

from collections import defaultdict
def LongestCommonSubstring(string1, string2):
match = ""
matches = defaultdict(list)
str1, str2 = sorted([string1, string2], key=lambda x: len(x))


for i in range(len(str1)):
for k in range(i, len(str1)):
cur = match + str1[k]
if cur in str2:
match = cur
else:
match = ""
            

if match:
matches[len(match)].append(match)
        

if not matches:
return ""


longest_match = max(matches.keys())
        

return matches[longest_match][0]


Some example cases:

LongestCommonSubstring("whose car?", "this is my car")
> ' car'
LongestCommonSubstring("apple pies", "apple? forget apple pie!")
> 'apple pie'


def LongestSubString(s1,s2):
if len(s1)<len(s2) :
s1,s2 = s2,s1
    

maxsub =''
for i in range(len(s2)):
for j in range(len(s2),i,-1):
if s2[i:j] in s1 and j-i>len(maxsub):
return  s2[i:j]

The fastest way I've found is to use suffix_trees package:

from suffix_trees import STree


a = ["xxxabcxxx", "adsaabc"]
st = STree.STree(a)
print(st.lcs()) # "abc"