按大写字母拆分字符串

在给定字符集出现之前分割字符串的 蟒蛇方法是什么?

比如说,我想离开 'TheLongAndWindingRoad' 在任何出现大写字母的情况下(可能除了第一个字母外) ,并获取 ['The', 'Long', 'And', 'Winding', 'Road'].

编辑: 它还应该分割单个事件,即。 从 'ABC'我想获得 ['A', 'B', 'C'].

126262 次浏览

不幸的是,在 Python 中无法使用 在零宽度匹配上分割,但是你可以使用 re.findall:

>>> import re
>>> re.findall('[A-Z][^A-Z]*', 'TheLongAndWindingRoad')
['The', 'Long', 'And', 'Winding', 'Road']
>>> re.findall('[A-Z][^A-Z]*', 'ABC')
['A', 'B', 'C']
import re
filter(None, re.split("([A-Z][^A-Z]*)", "TheLongAndWindingRoad"))

或者

[s for s in re.split("([A-Z][^A-Z]*)", "TheLongAndWindingRoad") if s]
>>> import re
>>> re.findall('[A-Z][a-z]*', 'TheLongAndWindingRoad')
['The', 'Long', 'And', 'Winding', 'Road']


>>> re.findall('[A-Z][a-z]*', 'SplitAString')
['Split', 'A', 'String']


>>> re.findall('[A-Z][a-z]*', 'ABC')
['A', 'B', 'C']

如果您希望 "It'sATest"分裂为 ["It's", 'A', 'Test'],请将 rexeg 更改为 "[A-Z][a-z']*"

替代解决方案(如果您不喜欢显式的正则表达式) :

s = 'TheLongAndWindingRoad'


pos = [i for i,e in enumerate(s) if e.isupper()]


parts = []
for j in xrange(len(pos)):
try:
parts.append(s[pos[j]:pos[j+1]])
except IndexError:
parts.append(s[pos[j]:])


print parts

@ ChristopheD 解的一个变体

s = 'TheLongAndWindingRoad'


pos = [i for i,e in enumerate(s+'A') if e.isupper()]
parts = [s[pos[j]:pos[j+1]] for j in xrange(len(pos)-1)]


print parts

下面是另一种正则表达式解决方案。这个问题可以重新表述为“如何在每个大写字母之前插入一个空格,然后再进行拆分”:

>>> s = "TheLongAndWindingRoad ABC A123B45"
>>> re.sub( r"([A-Z])", r" \1", s).split()
['The', 'Long', 'And', 'Winding', 'Road', 'A', 'B', 'C', 'A123', 'B45']

这样做的优点是可以保留所有非空格字符,而大多数其他解决方案都不能做到这一点。

src = 'TheLongAndWindingRoad'
glue = ' '


result = ''.join(glue + x if x.isupper() else x for x in src).strip(glue).split(glue)

不使用正则表达式或枚举的替代方法:

word = 'TheLongAndWindingRoad'
list = [x for x in word]


for char in list:
if char != list[0] and char.isupper():
list[list.index(char)] = ' ' + char


fin_list = ''.join(list).split(' ')

我认为这样做更清晰、更简单,不需要太多的方法,也不需要很长的列表内涵,因为很难阅读。

使用 enumerateisupper()的替代方法

密码:

strs = 'TheLongAndWindingRoad'
ind =0
count =0
new_lst=[]
for index, val in enumerate(strs[1:],1):
if val.isupper():
new_lst.append(strs[ind:index])
ind=index
if ind<len(strs):
new_lst.append(strs[ind:])
print new_lst

产出:

['The', 'Long', 'And', 'Winding', 'Road']

另一个没有正则表达式和能力保持连续大写如果需要

def split_on_uppercase(s, keep_contiguous=False):
"""


Args:
s (str): string
keep_contiguous (bool): flag to indicate we want to
keep contiguous uppercase chars together


Returns:


"""


string_length = len(s)
is_lower_around = (lambda: s[i-1].islower() or
string_length > (i + 1) and s[i + 1].islower())


start = 0
parts = []
for i in range(1, string_length):
if s[i].isupper() and (not keep_contiguous or is_lower_around()):
parts.append(s[start: i])
start = i
parts.append(s[start:])


return parts


>>> split_on_uppercase('theLongWindingRoad')
['the', 'Long', 'Winding', 'Road']
>>> split_on_uppercase('TheLongWindingRoad')
['The', 'Long', 'Winding', 'Road']
>>> split_on_uppercase('TheLongWINDINGRoadT', True)
['The', 'Long', 'WINDING', 'Road', 'T']
>>> split_on_uppercase('ABC')
['A', 'B', 'C']
>>> split_on_uppercase('ABCD', True)
['ABCD']
>>> split_on_uppercase('')
['']
>>> split_on_uppercase('hello world')
['hello world']

使用 more_itertools.split_before工具可以做到这一点。

import more_itertools as mit




iterable = "TheLongAndWindingRoad"
[ "".join(i) for i in mit.split_before(iterable, pred=lambda s: s.isupper())]
# ['The', 'Long', 'And', 'Winding', 'Road']

它还应该分割单个匹配项,即从 'ABC'获得 ['A', 'B', 'C']

iterable = "ABC"
[ "".join(i) for i in mit.split_before(iterable, pred=lambda s: s.isupper())]
# ['A', 'B', 'C']

more_itertools 是一个拥有60多个有用工具的第三方软件包,包括所有原始 Itertools 食谱的实现,这避免了手工实现。

用一个空格加上那个字母“ L”来替换给定中的每个大写字母“ L”。我们可以使用列表内涵或者我们可以定义一个函数来做到这一点。

s = 'TheLongANDWindingRoad ABC A123B45'
''.join([char if (char.islower() or not char.isalpha()) else ' '+char for char in list(s)]).strip().split()
>>> ['The', 'Long', 'A', 'N', 'D', 'Winding', 'Road', 'A', 'B', 'C', 'A123', 'B45']

如果您选择使用函数,以下是方法。

def splitAtUpperCase(text):
result = ""
for char in text:
if char.isupper():
result += " " + char
else:
result += char
return result.split()

例如:

print(splitAtUpperCase('TheLongAndWindingRoad'))
>>>['The', 'Long', 'A', 'N', 'D', 'Winding', 'Road']

但是大多数时候,我们用大写字母来分割一个句子,通常情况下,我们希望保持缩写,它通常是一个连续的大写字母流。下面的代码会有所帮助。

def splitAtUpperCase(s):
for i in range(len(s)-1)[::-1]:
if s[i].isupper() and s[i+1].islower():
s = s[:i]+' '+s[i:]
if s[i].isupper() and s[i-1].islower():
s = s[:i]+' '+s[i:]
return s.split()


splitAtUpperCase('TheLongANDWindingRoad')


>>> ['The', 'Long', 'AND', 'Winding', 'Road']

谢谢。

分享我读这篇文章时想到的东西。与其他文章不同。

strs = 'TheLongAndWindingRoad'


# grab index of uppercase letters in strs
start_idx = [i for i,j in enumerate(strs) if j.isupper()]


# create empty list
strs_list = []


# initiate counter
cnt = 1


for pos in start_idx:
start_pos = pos


# use counter to grab next positional element and overlook IndexeError
try:
end_pos = start_idx[cnt]
except IndexError:
continue


# append to empty list
strs_list.append(strs[start_pos:end_pos])


cnt += 1

使用向前看和向后看:

在 Python 3.7中,您可以这样做:

re.split('(?<=.)(?=[A-Z])', 'TheLongAndWindingRoad')

结果是:

['The', 'Long', 'And', 'Winding', 'Road']

您需要向后查看以避免在开始时出现空字符串。

Python 的方式可以是:

"".join([(" "+i if i.isupper() else i) for i in 'TheLongAndWindingRoad']).strip().split()
['The', 'Long', 'And', 'Winding', 'Road']

适用于 Unicode,避免了 re/re2。

"".join([(" "+i if i.isupper() else i) for i in 'СуперМаркетыПродажаКлиент']).strip().split()
['Супер', 'Маркеты', 'Продажа', 'Клиент']

我认为更好的答案 也许吧是把字符串分成不以大写结尾的单词。这将处理字符串不以大写字母开头的情况。

 re.findall('.[^A-Z]*', 'aboutTheLongAndWindingRoad')

例如:

>>> import re
>>> re.findall('.[^A-Z]*', 'aboutTheLongAndWindingRoadABC')
['about', 'The', 'Long', 'And', 'Winding', 'Road', 'A', 'B', 'C']

你也可以这么做

def camelcase(s):
    

words = []
    

for char in s:
if char.isupper():
words.append(':'+char)
else:
words.append(char)
words = ((''.join(words)).split(':'))
    

return len(words)

输出如下

s = 'oneTwoThree'
print(camecase(s)
//['one', 'Two', 'Three']
def solution(s):
   

st = ''
for c in s:
if c == c.upper():
st += ' '
st += c
       

return st

我用的是名单

def split_by_upper(x):
i = 0
lis = list(x)
while True:
if i == len(lis)-1:
if lis[i].isupper():
lis.insert(i,",")
break
if lis[i].isupper() and i != 0:
lis.insert(i,",")
i+=1
i+=1
return "".join(lis).split(",")

产出:

data = "TheLongAndWindingRoad"
print(split_by_upper(data))`
>> ['The', 'Long', 'And', 'Winding', 'Road']

我对大写字母分割的解决方案——保持大写单词

text = 'theLongAndWindingRoad ABC'
result = re.sub('(?<=.)(?=[A-Z][a-z])', r" ", text).split()
print(result)
#['the', 'Long', 'And', 'Winding', 'Road', 'ABC']

虽然有点晚了,但是:

In [1]: camel = "CamelCaseConfig"
In [2]: parts = "".join([
f"|{c}" if c.isupper() else c
for c in camel
]).lstrip("|").split("|")
In [3]: screaming_snake = "_".join([
part.upper()
for part in parts
])
In [4]: screaming_snake
Out[4]: 'CAMEL_CASE_CONFIG'

我的一部分答案是基于其他人在这里的回答