在Python中,如何分割字符串并保留分隔符?

这是最简单的解释。这是我正在使用的:

re.split('\W', 'foo/bar spam\neggs')
>>> ['foo', 'bar', 'spam', 'eggs']

这是我想要的:

someMethod('\W', 'foo/bar spam\neggs')
>>> ['foo', '/', 'bar', ' ', 'spam', '\n', 'eggs']

原因是我想把一个字符串分割成令牌,操作它,然后再把它组合在一起。

181047 次浏览

re.split的文档提到:

根据模式的出现情况拆分字符串。< >强如果捕获 的所有组的文本 模式也作为结果列表的一部分返回.

.

所以你只需要用一个捕获组来包装分隔符:

>>> re.split('(\W)', 'foo/bar spam\neggs')
['foo', '/', 'bar', ' ', 'spam', '\n', 'eggs']

你也可以用字符串数组而不是正则表达式分割字符串,就像这样:

def tokenizeString(aString, separators):
#separators is an array of strings that are being used to split the string.
#sort separators in order of descending length
separators.sort(key=len)
listToReturn = []
i = 0
while i < len(aString):
theSeparator = ""
for current in separators:
if current == aString[i:i+len(current)]:
theSeparator = current
if theSeparator != "":
listToReturn += [theSeparator]
i = i + len(theSeparator)
else:
if listToReturn == []:
listToReturn = [""]
if(listToReturn[-1] in separators):
listToReturn += [""]
listToReturn[-1] += aString[i]
i += 1
return listToReturn
    



print(tokenizeString(aString = "\"\"\"hi\"\"\" hello + world += (1*2+3/5) '''hi'''", separators = ["'''", '+=', '+', "/", "*", "\\'", '\\"', "-=", "-", " ", '"""', "(", ")"]))
# This keeps all separators  in result
##########################################################################
import re
st="%%(c+dd+e+f-1523)%%7"
sh=re.compile('[\+\-//\*\<\>\%\(\)]')


def splitStringFull(sh, st):
ls=sh.split(st)
lo=[]
start=0
for l in ls:
if not l : continue
k=st.find(l)
llen=len(l)
if k> start:
tmp= st[start:k]
lo.append(tmp)
lo.append(l)
start = k + llen
else:
lo.append(l)
start =llen
return lo
#############################


li= splitStringFull(sh , st)
['%%(', 'c', '+', 'dd', '+', 'e', '+', 'f', '-', '1523', ')%%', '7']

另一个在Python 3上工作良好的非正则表达式解决方案

# Split strings and keep separator
test_strings = ['<Hello>', 'Hi', '<Hi> <Planet>', '<', '']


def split_and_keep(s, sep):
if not s: return [''] # consistent with string.split()


# Find replacement character that is not used in string
# i.e. just use the highest available character plus one
# Note: This fails if ord(max(s)) = 0x10FFFF (ValueError)
p=chr(ord(max(s))+1)


return s.replace(sep, sep+p).split(p)


for s in test_strings:
print(split_and_keep(s, '<'))




# If the unicode limit is reached it will fail explicitly
unicode_max_char = chr(1114111)
ridiculous_string = '<Hello>'+unicode_max_char+'<World>'
print(split_and_keep(ridiculous_string, '<'))

如果在换行符上进行分割,则使用splitlines(True)

>>> 'line 1\nline 2\nline without newline'.splitlines(True)
['line 1\n', 'line 2\n', 'line without newline']

(不是一个通用的解决方案,但在这里添加这个,以防有人来到这里没有意识到这个方法的存在。)

如果你想拆分字符串,同时通过regex保留分隔符,而不捕获组:

def finditer_with_separators(regex, s):
matches = []
prev_end = 0
for match in regex.finditer(s):
match_start = match.start()
if (prev_end != 0 or match_start > 0) and match_start != prev_end:
matches.append(s[prev_end:match.start()])
matches.append(match.group())
prev_end = match.end()
if prev_end < len(s):
matches.append(s[prev_end:])
return matches


regex = re.compile(r"[\(\)]")
matches = finditer_with_separators(regex, s)

如果假设regex被封装到捕获组中:

def split_with_separators(regex, s):
matches = list(filter(None, regex.split(s)))
return matches


regex = re.compile(r"([\(\)])")
matches = split_with_separators(regex, s)

这两种方法也将删除空组,在大多数情况下是无用和恼人的。

如果你只有一个分隔符,你可以使用列表推导式:

text = 'foo,bar,baz,qux'
sep = ','

附加/将分隔符:

result = [x+sep for x in text.split(sep)]
#['foo,', 'bar,', 'baz,', 'qux,']
# to get rid of trailing
result[-1] = result[-1].strip(sep)
#['foo,', 'bar,', 'baz,', 'qux']


result = [sep+x for x in text.split(sep)]
#[',foo', ',bar', ',baz', ',qux']
# to get rid of trailing
result[0] = result[0].strip(sep)
#['foo', ',bar', ',baz', ',qux']

分隔符作为它自己的元素:

result = [u for x in text.split(sep) for u in (x, sep)]
#['foo', ',', 'bar', ',', 'baz', ',', 'qux', ',']
results = result[:-1]   # to get rid of trailing

另一个例子,在非字母数字上进行分割,并保留分隔符

import re
a = "foo,bar@candy*ice%cream"
re.split('([^a-zA-Z0-9])',a)

输出:

['foo', ',', 'bar', '@', 'candy', '*', 'ice', '%', 'cream']

解释

re.split('([^a-zA-Z0-9])',a)


() <- keep the separators
[] <- match everything in between
^a-zA-Z0-9 <-except alphabets, upper/lower and numbers.

一个懒惰而简单的解决方案

假设你的正则表达式模式是split_pattern = r'(!|\?)'

首先,添加一些与新分隔符相同的字符,如'[cut]'

new_string = re.sub(split_pattern, '\\1[cut]', your_string)

然后拆分新的分隔符new_string.split('[cut]')

我有一个类似的问题,试图分割一个文件路径,并努力找到一个简单的答案。 这对我来说很有效,并且不需要将分隔符替换回分割文本:

my_path = 'folder1/folder2/folder3/file1'

import re

re.findall('[^/]+/|[^/]+', my_path)

返回:

['folder1/', 'folder2/', 'folder3/', 'file1']

我发现这种基于生成器的方法更令人满意:

def split_keep(string, sep):
"""Usage:
>>> list(split_keep("a.b.c.d", "."))
['a.', 'b.', 'c.', 'd']
"""
start = 0
while True:
end = string.find(sep, start) + 1
if end == 0:
break
yield string[start:end]
start = end
yield string[start:]

它避免了需要找出正确的正则表达式,而在理论上应该相当便宜。它不创建新的字符串对象,并将大部分迭代工作委托给高效的find方法。

... 在Python 3.8中,它可以短到:

def split_keep(string, sep):
start = 0
while (end := string.find(sep, start) + 1) > 0:
yield string[start:end]
start = end
yield string[start:]
  1. 将所有seperator: (\W)替换为seperator + new_seperator: (\W;)

  2. new_seperator: (;)分割

def split_and_keep(seperator, s):
return re.split(';', re.sub(seperator, lambda match: match.group() + ';', s))


print('\W', 'foo/bar spam\neggs')

下面是一个简单的.split解决方案,不需要正则表达式。

这是Python split()而不删除分隔符的答案,所以不完全是原来的问题,但另一个问题被关闭为这个问题的副本。

def splitkeep(s, delimiter):
split = s.split(delimiter)
return [substr + delimiter for substr in split[:-1]] + [split[-1]]

随机测试:

import random


CHARS = [".", "a", "b", "c"]
assert splitkeep("", "X") == [""]  # 0 length test
for delimiter in ('.', '..'):
for _ in range(100000):
length = random.randint(1, 50)
s = "".join(random.choice(CHARS) for _ in range(length))
assert "".join(splitkeep(s, delimiter)) == s

我可以把它放在这里吗

s = 'foo/bar spam\neggs'
print(s.replace('/', '+++/+++').replace(' ', '+++ +++').replace('\n', '+++\n+++').split('+++'))


['foo', '/', 'bar', ' ', 'spam', '\n', 'eggs']

使用re.split,并且你的正则表达式来自变量,并且你有多个分隔符,你可以像下面这样使用:

# BashSpecialParamList is the special param in bash,
# such as your separator is the bash special param
BashSpecialParamList = ["$*", "$@", "$#", "$?", "$-", "$$", "$!", "$0"]
# aStr is the the string to be splited
aStr = "$a Klkjfd$0 $? $#%$*Sdfdf"


reStr = "|".join([re.escape(sepStr) for sepStr in BashSpecialParamList])


re.split(f'({reStr})', aStr)


# Then You can get the result:
# ['$a Klkjfd', '$0', ' ', '$?', ' ', '$#', '%', '$*', 'Sdfdf']

参考:GNU Bash特殊参数

安装wrs“不拆卸分流器”;通过做

pip install wrs

(由Rao Hamza开发)

import wrs
text  = "Now inbox “how to make spam ad” Invest in hard email marketing."
splitor = 'email | spam | inbox'
list = wrs.wr_split(splitor, text)
print(list)

< >强结果: ['现在','收件箱'如何制作','垃圾广告'努力投资','电子邮件营销。']

之前发布的一些答案,会重复分隔符,或者有一些我在自己的情况下遇到的其他错误。你可以使用这个函数:

def split_and_keep_delimiter(input, delimiter):
result      = list()
idx         = 0
while delimiter in input:
idx     = input.index(delimiter);
result.append(input[0:idx+len(delimiter)])
input = input[idx+len(delimiter):]
result.append(input)
return result

在下面的代码中,对这个问题有一个简单、高效且经过测试的答案。代码中有解释其中所有内容的注释。

我保证它并不像看起来那么可怕——它实际上只有13行代码!其余的都是注释、文档和断言

def split_including_delimiters(input: str, delimiter: str):
"""
Splits an input string, while including the delimiters in the output
    

Unlike str.split, we can use an empty string as a delimiter
Unlike str.split, the output will not have any extra empty strings
Conequently, len(''.split(delimiter))== 0 for all delimiters,
whereas len(input.split(delimiter))>0 for all inputs and delimiters
    

INPUTS:
input: Can be any string
delimiter: Can be any string


EXAMPLES:
>>> split_and_keep_delimiter('Hello World  ! ',' ')
ans = ['Hello ', 'World ', ' ', '! ', ' ']
>>> split_and_keep_delimiter("Hello**World**!***", "**")
ans = ['Hello', '**', 'World', '**', '!', '**', '*']
EXAMPLES:
assert split_and_keep_delimiter('-xx-xx-','xx') == ['-', 'xx', '-', 'xx', '-'] # length 5
assert split_and_keep_delimiter('xx-xx-' ,'xx') == ['xx', '-', 'xx', '-']      # length 4
assert split_and_keep_delimiter('-xx-xx' ,'xx') == ['-', 'xx', '-', 'xx']      # length 4
assert split_and_keep_delimiter('xx-xx'  ,'xx') == ['xx', '-', 'xx']           # length 3
assert split_and_keep_delimiter('xxxx'   ,'xx') == ['xx', 'xx']                # length 2
assert split_and_keep_delimiter('xxx'    ,'xx') == ['xx', 'x']                 # length 2
assert split_and_keep_delimiter('x'      ,'xx') == ['x']                       # length 1
assert split_and_keep_delimiter(''       ,'xx') == []                          # length 0
assert split_and_keep_delimiter('aaa'    ,'xx') == ['aaa']                     # length 1
assert split_and_keep_delimiter('aa'     ,'xx') == ['aa']                      # length 1
assert split_and_keep_delimiter('a'      ,'xx') == ['a']                       # length 1
assert split_and_keep_delimiter(''       ,''  ) == []                          # length 0
assert split_and_keep_delimiter('a'      ,''  ) == ['a']                       # length 1
assert split_and_keep_delimiter('aa'     ,''  ) == ['a', '', 'a']              # length 3
assert split_and_keep_delimiter('aaa'    ,''  ) == ['a', '', 'a', '', 'a']     # length 5
"""


# Input assertions
assert isinstance(input,str), "input must be a string"
assert isinstance(delimiter,str), "delimiter must be a string"


if delimiter:
# These tokens do not include the delimiter, but are computed quickly
tokens = input.split(delimiter)
else:
# Edge case: if the delimiter is the empty string, split between the characters
tokens = list(input)
        

# The following assertions are always true for any string input and delimiter
# For speed's sake, we disable this assertion
# assert delimiter.join(tokens) == input


output = tokens[:1]


for token in tokens[1:]:
output.append(delimiter)
if token:
output.append(token)
    

# Don't let the first element be an empty string
if output[:1]==['']:
del output[0]
        

# The only case where we should have an empty string in the output is if it is our delimiter
# For speed's sake, we disable this assertion
# assert delimiter=='' or '' not in output
        

# The resulting strings should be combinable back into the original string
# For speed's sake, we disable this assertion
# assert ''.join(output) == input


return output
>>> line = 'hello_toto_is_there'
>>> sep = '_'
>>> [sep + x[1] if x[0] != 0 else x[1] for x in enumerate(line.split(sep))]
['hello', '_toto', '_is', '_there']