查找两个子字符串之间的字符串

我如何找到两个子字符串('123STRINGabc' -> 'STRING')之间的字符串?

我现在的方法是这样的:

>>> start = 'asdf=5;'
>>> end = '123jasd'
>>> s = 'asdf=5;iwantthis123jasd'
>>> print((s.split(start))[1].split(end)[0])
iwantthis

然而,这似乎非常低效且不符合python规则。有什么更好的方法来做这样的事情吗?

字符串可能不会以startend开始和结束。

703900 次浏览
s[len(start):-len(end)]

我的方法是,

find index of start string in s => i
find index of end string in s => j


substring = substring(i+len(start) to j-1)
s = "123123STRINGabcabc"


def find_between( s, first, last ):
try:
start = s.index( first ) + len( first )
end = s.index( last, start )
return s[start:end]
except ValueError:
return ""


def find_between_r( s, first, last ):
try:
start = s.rindex( first ) + len( first )
end = s.rindex( last, start )
return s[start:end]
except ValueError:
return ""




print find_between( s, "123", "abc" )
print find_between_r( s, "123", "abc" )

给:

123STRING
STRINGabc

我认为应该注意的是——根据你需要的行为,你可以混合使用indexrindex调用,或者使用上述版本之一(它相当于正则表达式(.*)(.*?)组)。

这里有一种方法

_,_,rest = s.partition(start)
result,_,_ = rest.partition(end)
print result

另一种方法是使用regexp

import re
print re.findall(re.escape(start)+"(.*)"+re.escape(end),s)[0]

print re.search(re.escape(start)+"(.*)"+re.escape(end),s).group(1)
import re


s = 'asdf=5;iwantthis123jasd'
result = re.search('asdf=5;(.*)123jasd', s)
print(result.group(1))

这是我之前发布的Daniweb中的代码片段:

# picking up piece of string between separators
# function using partition, like partition, but drops the separators
def between(left,right,s):
before,_,a = s.partition(left)
a,_,after = a.partition(right)
return before,a,after


s = "bla bla blaa <a>data</a> lsdjfasdjöf (important notice) 'Daniweb forum' tcha tcha tchaa"
print between('<a>','</a>',s)
print between('(',')',s)
print between("'","'",s)


""" Output:
('bla bla blaa ', 'data', " lsdjfasdj\xc3\xb6f (important notice) 'Daniweb forum' tcha tcha tchaa")
('bla bla blaa <a>data</a> lsdjfasdj\xc3\xb6f ', 'important notice', " 'Daniweb forum' tcha tcha tchaa")
('bla bla blaa <a>data</a> lsdjfasdj\xc3\xb6f (important notice) ', 'Daniweb forum', ' tcha tcha tchaa')
"""

字符串格式为Nikolaus Gradwohl的建议增加了一些灵活性。startend现在可以根据需要进行修改。

import re


s = 'asdf=5;iwantthis123jasd'
start = 'asdf=5;'
end = '123jasd'


result = re.search('%s(.*)%s' % (start, end), s).group(1)
print(result)

要提取STRING,请尝试:

myString = '123STRINGabc'
startString = '123'
endString = 'abc'


mySubString=myString[myString.find(startString)+len(startString):myString.find(endString)]
start = 'asdf=5;'
end = '123jasd'
s = 'asdf=5;iwantthis123jasd'
print s[s.find(start)+len(start):s.rfind(end)]

给了

iwantthis
source='your token _here0@df and maybe _here1@df or maybe _here2@df'
start_sep='_'
end_sep='@df'
result=[]
tmp=source.split(start_sep)
for par in tmp:
if end_sep in par:
result.append(par.split(end_sep)[0])


print result
< p >必须显示: Here0, here1, here2

regex更好,但它需要额外的库,你可能只想使用python

只是把OP自己的解决方案转换成一个答案:

def find_between(s, start, end):
return (s.split(start))[1].split(end)[0]
这本质上是cji的回答——7月30日10日5:58。 我更改了try except结构,以便更清楚地说明导致异常的原因
def find_between( inputStr, firstSubstr, lastSubstr ):
'''
find between firstSubstr and lastSubstr in inputStr  STARTING FROM THE LEFT
http://stackoverflow.com/questions/3368969/find-string-between-two-substrings
above also has a func that does this FROM THE RIGHT
'''
start, end = (-1,-1)
try:
start = inputStr.index( firstSubstr ) + len( firstSubstr )
except ValueError:
print '    ValueError: ',
print "firstSubstr=%s  -  "%( firstSubstr ),
print sys.exc_info()[1]


try:
end = inputStr.index( lastSubstr, start )
except ValueError:
print '    ValueError: ',
print "lastSubstr=%s  -  "%( lastSubstr ),
print sys.exc_info()[1]


return inputStr[start:end]

这些解决方案假设起始字符串和最终字符串是不同的。下面是当初始和最终指示符相同时,我用于整个文件的解决方案,假设整个文件是使用readlines()读取的:

def extractstring(line,flag='$'):
if flag in line: # $ is the flag
dex1=line.index(flag)
subline=line[dex1+1:-1] #leave out flag (+1) to end of line
dex2=subline.index(flag)
string=subline[0:dex2].strip() #does not include last flag, strip whitespace
return(string)

例子:

lines=['asdf 1qr3 qtqay 45q at $A NEWT?$ asdfa afeasd',
'afafoaltat $I GOT BETTER!$ derpity derp derp']
for line in lines:
string=extractstring(line,flag='$')
print(string)

给:

A NEWT?
I GOT BETTER!

您可以简单地使用这段代码或复制下面的函数。全都整齐地排在一条线上。

def substring(whole, sub1, sub2):
return whole[whole.index(sub1) : whole.index(sub2)]

如果按照如下方式运行该函数。

print(substring("5+(5*2)+2", "(", "("))

你可能会得到这样的输出:

(5*2

而不是

5*2

如果您希望在输出的末尾有子字符串,代码必须如下所示。

return whole[whole.index(sub1) : whole.index(sub2) + 1]

但如果不希望子字符串在末尾,则+1必须在第一个值上。

return whole[whole.index(sub1) + 1 : whole.index(sub2)]
from timeit import timeit
from re import search, DOTALL




def partition_find(string, start, end):
return string.partition(start)[2].rpartition(end)[0]




def re_find(string, start, end):
# applying re.escape to start and end would be safer
return search(start + '(.*)' + end, string, DOTALL).group(1)




def index_find(string, start, end):
return string[string.find(start) + len(start):string.rfind(end)]




# The wikitext of "Alan Turing law" article form English Wikipeida
# https://en.wikipedia.org/w/index.php?title=Alan_Turing_law&action=edit&oldid=763725886
string = """..."""
start = '==Proposals=='
end = '==Rival bills=='


assert index_find(string, start, end) \
== partition_find(string, start, end) \
== re_find(string, start, end)


print('index_find', timeit(
'index_find(string, start, end)',
globals=globals(),
number=100_000,
))


print('partition_find', timeit(
'partition_find(string, start, end)',
globals=globals(),
number=100_000,
))


print('re_find', timeit(
're_find(string, start, end)',
globals=globals(),
number=100_000,
))

结果:

index_find 0.35047444528454114
partition_find 0.5327825636197754
re_find 7.552149639286381

在这个例子中,re_find几乎比index_find慢20倍。

这对我来说似乎更直接:

import re


s = 'asdf=5;iwantthis123jasd'
x= re.search('iwantthis',s)
print(s[x.start():x.end()])

使用来自不同电子邮件平台的分隔符解析文本带来了这个问题的更大版本。它们通常有一个开始和一个停止。通配符的分隔符字符不断阻塞正则表达式。这里提到了分裂的问题&其他地方——哎呀,分隔符不见了。我突然想到使用replace()来让split()使用其他东西。代码块:

nuke = '~~~'
start = '|*'
stop = '*|'
julien = (textIn.replace(start,nuke + start).replace(stop,stop + nuke).split(nuke))
keep = [chunk for chunk in julien if start in chunk and stop in chunk]
logging.info('keep: %s',keep)

下面是一个函数,我做了返回一个字符串(s)之间的字符串string1和string2搜索列表。

def GetListOfSubstrings(stringSubject,string1,string2):
MyList = []
intstart=0
strlength=len(stringSubject)
continueloop = 1


while(intstart < strlength and continueloop == 1):
intindex1=stringSubject.find(string1,intstart)
if(intindex1 != -1): #The substring was found, lets proceed
intindex1 = intindex1+len(string1)
intindex2 = stringSubject.find(string2,intindex1)
if(intindex2 != -1):
subsequence=stringSubject[intindex1:intindex2]
MyList.append(subsequence)
intstart=intindex2+len(string2)
else:
continueloop=0
else:
continueloop=0
return MyList




#Usage Example
mystring="s123y123o123pp123y6"
List = GetListOfSubstrings(mystring,"1","y68")
for x in range(0, len(List)):
print(List[x])
output:




mystring="s123y123o123pp123y6"
List = GetListOfSubstrings(mystring,"1","3")
for x in range(0, len(List)):
print(List[x])
output:
2
2
2
2


mystring="s123y123o123pp123y6"
List = GetListOfSubstrings(mystring,"1","y")
for x in range(0, len(List)):
print(List[x])
output:
23
23o123pp123

从Nikolaus Gradwohl的答案进一步,我需要从下面的文件内容(文件名:docker-compose.yml)获得版本号(即发布)之间('ui:'和'-'):

    version: '3.1'
services:
ui:
image: repo-pkg.dev.io:21/website/ui:0.0.2-QA1
#network_mode: host
ports:
- 443:9999
ulimits:
nofile:test

这是它如何为我工作(python脚本):

import re, sys


f = open('docker-compose.yml', 'r')
lines = f.read()
result = re.search('ui:(.*)-', lines)
print result.group(1)




Result:
0.0.2

如果你不想导入任何东西,请尝试字符串方法.index():

text = 'I want to find a string between two substrings'
left = 'find a '
right = 'between two'


# Output: 'string'
print(text[text.index(left)+len(left):text.index(right)])