Is there a generator version of `string.split()` in Python?

小开

没有，但是使用 itertools.takewhile()编写一个应该很容易。

编辑:

非常简单，只有一半的执行:

import itertools
import string


def isplitwords(s):
i = iter(s)
while True:
r = []
for c in itertools.takewhile(lambda x: not x in string.whitespace, i):
r.append(c)
else:
if r:
yield ''.join(r)
continue
else:
raise StopIteration()

小开

~~我不认为 split()的生成器版本有任何明显的好处。生成器对象必须包含要迭代的整个字符串，因此使用生成器不会节省任何内存。~~

不过，如果你想写一本书，那就相当容易了:

import string


def gsplit(s,sep=string.whitespace):
word = []


for c in s:
if c in sep:
if word:
yield "".join(word)
word = []
else:
word.append(c)


if word:
yield "".join(word)

小开

这是通过 re.search()实现的 split()的生成器版本，它不存在分配太多子字符串的问题。

import re


def itersplit(s, sep=None):
exp = re.compile(r'\s+' if sep is None else re.escape(sep))
pos = 0
while True:
m = exp.search(s, pos)
if not m:
if pos < len(s) or sep is not None:
yield s[pos:]
break
if pos < m.start() or sep is not None:
yield s[pos:m.start()]
pos = m.end()




sample1 = "Good evening, world!"
sample2 = " Good evening, world! "
sample3 = "brackets][all][][over][here"
sample4 = "][brackets][all][][over][here]["


assert list(itersplit(sample1)) == sample1.split()
assert list(itersplit(sample2)) == sample2.split()
assert list(itersplit(sample3, '][')) == sample3.split('][')
assert list(itersplit(sample4, '][')) == sample4.split('][')

编辑: 如果没有给出分隔符，正确处理周围的空格。

小开

最佳答案

很有可能 re.finditer使用的内存开销相当小。

def split_iter(string):
return (x.group(0) for x in re.finditer(r"[A-Za-z']+", string))

演示:

>>> list( split_iter("A programmer's RegEx test.") )
['A', "programmer's", 'RegEx', 'test']

编辑: 我刚刚确认了这在 python3.2.1中需要常量内存，假设我的测试方法是正确的。我创建了一个非常大的字符串(1 GB 左右) ，然后用一个 for循环(不是列表内涵，这会产生额外的内存)迭代遍历这个迭代。这并没有导致显著的内存增长(也就是说，如果内存增长，那么它将远远小于1GB 字符串)。

更一般的版本:

在回应“我看不到与 str.split的联系”的评论时，这里有一个更一般的版本:

def splitStr(string, sep="\s+"):
# warning: does not yet work if sep is a lookahead like `(?=b)`
if sep=='':
return (c for c in string)
else:
return (_.group(1) for _ in re.finditer(f'(?:^|{sep})((?:(?!{sep}).)*)', string))

    # alternatively, more verbosely:
regex = f'(?:^|{sep})((?:(?!{sep}).)*)'
for match in re.finditer(regex, string):
fragment = match.group(1)
yield fragment

其思想是，((?!pat).)*“否定”一个组，通过确保它贪婪地匹配，直到模式开始匹配(Lookahead 不使用正则表达式有限状态机中的字符串)。在伪代码中: 重复使用(begin-of-string xor {sep}) + as much as possible until we would be able to begin again (or hit end of string)

演示:

>>> splitStr('.......A...b...c....', sep='...')
<generator object splitStr.<locals>.<genexpr> at 0x7fe8530fb5e8>


>>> list(splitStr('A,b,c.', sep=','))
['A', 'b', 'c.']


>>> list(splitStr(',,A,b,c.,', sep=','))
['', '', 'A', 'b', 'c.', '']


>>> list(splitStr('.......A...b...c....', '\.\.\.'))
['', '', '.A', 'b', 'c', '.']


>>> list(splitStr('   A  b  c. '))
['', 'A', 'b', 'c.', '']

(有人应该注意到，Str.Split有一个丑陋的行为: 它的特殊情况下有 sep=None作为第一个做 str.strip删除前导和尾随空格。上面故意没有这样做; 请参阅最后一个示例，其中 sep = "\s+"。)

(我遇到了各种各样的错误(包括一个内部 re.error) ，当试图实现这个... 负向查找将限制您到固定长度的分隔符，所以我们不使用它。除了上面的正则表达式之外，几乎所有的东西都会导致字符串开头和字符串结尾边界情况的错误(例如，',,,a,,b,c'上的 r'(.*?)($|,)'返回的 ['', '', '', 'a', '', 'b', 'c', '']在结尾处有一个无关的空字符串; 人们可以查看编辑历史，找到另一个看似正确的正则表达式，但实际上它有一些细微的错误)

(如果您想要自己实现这个以获得更高的性能(尽管它们是重量级的，正则表达式最重要的是在 C 中运行) ，您可以编写一些代码(使用 ctype？不知道如何让发电机正常工作?)，使用以下用于固定长度分隔符的伪代码: 散列长度分隔符 L。在使用运行散列算法扫描字符串时，保持长度为 L 的运行散列，O (1)更新时间。每当哈希可能等于您的分隔符时，手动检查过去的几个字符是否是分隔符; 如果是，那么从上次产生后产生子字符串。字符串开头和结尾的特殊情况。这将是教科书算法的一个生成器版本，用于进行 O (N)文本搜索。多处理版本也是可能的。它们可能看起来有些夸张，但是问题暗示着一个人正在使用真正巨大的字符串... ... 在这一点上，你可能会考虑一些疯狂的事情，比如缓存字节偏移，如果它们很少，或者使用磁盘支持的字节串视图对象从磁盘工作，购买更多的 RAM，等等

小开

我所能想到的使用 str.find()方法的 offset参数编写代码的最有效方法。这避免了大量的内存使用，并且在不需要 regexp 时依赖于它的开销。

[ edit 2016-8-2: 更新为可选支持正则表达式分隔符]

def isplit(source, sep=None, regex=False):
"""
generator version of str.split()


:param source:
source string (unicode or bytes)


:param sep:
separator to split on.


:param regex:
if True, will treat sep as regular expression.


:returns:
generator yielding elements of string.
"""
if sep is None:
# mimic default python behavior
source = source.strip()
sep = "\\s+"
if isinstance(source, bytes):
sep = sep.encode("ascii")
regex = True
if regex:
# version using re.finditer()
if not hasattr(sep, "finditer"):
sep = re.compile(sep)
start = 0
for m in sep.finditer(source):
idx = m.start()
assert idx >= start
yield source[start:idx]
start = m.end()
yield source[start:]
else:
# version using str.find(), less overhead than re.finditer()
sepsize = len(sep)
start = 0
while True:
idx = source.find(sep, start)
if idx == -1:
yield source[start:]
return
yield source[start:idx]
start = idx + sepsize

这个可以像你想要的那样使用。

>>> print list(isplit("abcb","b"))
['a','c','']

虽然每次执行 find ()或切片时，字符串中都会有一点成本搜索，但这应该是最小的，因为字符串在内存中表示为连续数组。

小开

下面是我的实现，它比这里的其他答案要快得多、完整得多。它有4个独立的子功能为不同的情况。

我将只复制主 str_split函数的 docstring:

str_split(s, *delims, empty=None)

将字符串 s拆分为其余的参数，可以省略空的部分(empty关键字参数负责这一点)。这是一个生成器函数。

当只提供一个分隔符时，该字符串将被它简单地拆分。缺省情况下，empty是 True。

str_split('[]aaa[][]bb[c', '[]')
-> '', 'aaa', '', 'bb[c'
str_split('[]aaa[][]bb[c', '[]', empty=False)
-> 'aaa', 'bb[c'

当提供多个分隔符时，字符串以最长分隔默认情况下，这些分隔符的可能序列，或者，如果 empty设置为 True，分隔符之间的空字符串也包括在内本例中的分隔符可能只是单个字符。

str_split('aaa, bb : c;', ' ', ',', ':', ';')
-> 'aaa', 'bb', 'c'
str_split('aaa, bb : c;', *' ,:;', empty=True)
-> 'aaa', '', 'bb', '', '', 'c', ''

如果没有提供分隔符，则使用 string.whitespace，因此效果与 str.split()相同，只不过这个函数是一个生成器。

str_split('aaa\\t  bb c \\n')
-> 'aaa', 'bb', 'c'

import string


def _str_split_chars(s, delims):
"Split the string `s` by characters contained in `delims`, including the \
empty parts between two consecutive delimiters"
start = 0
for i, c in enumerate(s):
if c in delims:
yield s[start:i]
start = i+1
yield s[start:]


def _str_split_chars_ne(s, delims):
"Split the string `s` by longest possible sequences of characters \
contained in `delims`"
start = 0
in_s = False
for i, c in enumerate(s):
if c in delims:
if in_s:
yield s[start:i]
in_s = False
else:
if not in_s:
in_s = True
start = i
if in_s:
yield s[start:]




def _str_split_word(s, delim):
"Split the string `s` by the string `delim`"
dlen = len(delim)
start = 0
try:
while True:
i = s.index(delim, start)
yield s[start:i]
start = i+dlen
except ValueError:
pass
yield s[start:]


def _str_split_word_ne(s, delim):
"Split the string `s` by the string `delim`, not including empty parts \
between two consecutive delimiters"
dlen = len(delim)
start = 0
try:
while True:
i = s.index(delim, start)
if start!=i:
yield s[start:i]
start = i+dlen
except ValueError:
pass
if start<len(s):
yield s[start:]




def str_split(s, *delims, empty=None):
"""\
Split the string `s` by the rest of the arguments, possibly omitting
empty parts (`empty` keyword argument is responsible for that).
This is a generator function.


When only one delimiter is supplied, the string is simply split by it.
`empty` is then `True` by default.
str_split('[]aaa[][]bb[c', '[]')
-> '', 'aaa', '', 'bb[c'
str_split('[]aaa[][]bb[c', '[]', empty=False)
-> 'aaa', 'bb[c'


When multiple delimiters are supplied, the string is split by longest
possible sequences of those delimiters by default, or, if `empty` is set to
`True`, empty strings between the delimiters are also included. Note that
the delimiters in this case may only be single characters.
str_split('aaa, bb : c;', ' ', ',', ':', ';')
-> 'aaa', 'bb', 'c'
str_split('aaa, bb : c;', *' ,:;', empty=True)
-> 'aaa', '', 'bb', '', '', 'c', ''


When no delimiters are supplied, `string.whitespace` is used, so the effect
is the same as `str.split()`, except this function is a generator.
str_split('aaa\\t  bb c \\n')
-> 'aaa', 'bb', 'c'
"""
if len(delims)==1:
f = _str_split_word if empty is None or empty else _str_split_word_ne
return f(s, delims[0])
if len(delims)==0:
delims = string.whitespace
delims = set(delims) if len(delims)>=4 else ''.join(delims)
if any(len(d)>1 for d in delims):
raise ValueError("Only 1-character multiple delimiters are supported")
f = _str_split_chars if empty else _str_split_chars_ne
return f(s, delims)

这个函数可以在 Python3中工作，并且可以应用一个简单但相当丑陋的修复程序使其在2和3版本中都可以工作。函数的第一行应改为:

def str_split(s, *delims, **kwargs):
"""...docstring..."""
empty = kwargs.get('empty')

小开

def split_generator(f,s):
"""
f is a string, s is the substring we split on.
This produces a generator rather than a possibly
memory intensive list.
"""
i=0
j=0
while j<len(f):
if i>=len(f):
yield f[j:]
j=i
elif f[i] != s:
i=i+1
else:
yield [f[j:i]]
j=i+1
i=i+1

小开

我编写了@ninjagecko 回答的一个版本，它的行为更像 string.split (即默认分隔的空格，您可以指定一个分隔符)。

def isplit(string, delimiter = None):
"""Like string.split but returns an iterator (lazy)


Multiple character delimters are not handled.
"""


if delimiter is None:
# Whitespace delimited by default
delim = r"\s"


elif len(delimiter) != 1:
raise ValueError("Can only handle single character delimiters",
delimiter)


else:
# Escape, incase it's "\", "*" etc.
delim = re.escape(delimiter)


return (x.group(0) for x in re.finditer(r"[^{}]+".format(delim), string))

下面是我使用的测试(在 python3和 python2中) :

# Wrapper to make it a list
def helper(*args,  **kwargs):
return list(isplit(*args, **kwargs))


# Normal delimiters
assert helper("1,2,3", ",") == ["1", "2", "3"]
assert helper("1;2;3,", ";") == ["1", "2", "3,"]
assert helper("1;2 ;3,  ", ";") == ["1", "2 ", "3,  "]


# Whitespace
assert helper("1 2 3") == ["1", "2", "3"]
assert helper("1\t2\t3") == ["1", "2", "3"]
assert helper("1\t2 \t3") == ["1", "2", "3"]
assert helper("1\n2\n3") == ["1", "2", "3"]


# Surrounding whitespace dropped
assert helper(" 1 2  3  ") == ["1", "2", "3"]


# Regex special characters
assert helper(r"1\2\3", "\\") == ["1", "2", "3"]
assert helper(r"1*2*3", "*") == ["1", "2", "3"]


# No multi-char delimiters allowed
try:
helper(r"1,.2,.3", ",.")
assert False
except ValueError:
pass

Python 的 regex 模块说它是用于 unicode 空格的 “正确的事”，但我还没有真正测试过它。

也可作为一个大意。

小开

如果你也想能够使用读迭代器(以及返回迭代器) ，试试这个:

import itertools as it


def iter_split(string, sep=None):
sep = sep or ' '
groups = it.groupby(string, lambda s: s != sep)
return (''.join(g) for k, g in groups if k)

用法

>>> list(iter_split(iter("Good evening, world!")))
['Good', 'evening,', 'world!']

小开

对提出的各种方法进行了一些性能测试(这里不再重复):

str.split(默认值 = 0.3461570239996945
手动搜索(按字符)(Dave Webb 的答案之一) = 0.8260340550004912
re.finditer(ninjagecko 的答案) = 0.698872097000276
str.find(Eli Collins 的答案之一) = 0.7230395330007013
itertools.takewhile(Ignacio Vazquez-Abrams 的答案) = 2.023023967998597
递归 = N/A

? 递归答案(string.split和 maxsplit = 1)无法在合理的时间内完成，考虑到 string.split的速度，它们可能在较短的字符串上工作得更好，但是我看不出在内存不是问题的短字符串的用例。

使用 timeit在以下场合进行测试:

the_text = "100 " * 9999 + "100"


def test_function( method ):
def fn( ):
total = 0


for x in method( the_text ):
total += int( x )


return total


return fn

这就提出了另一个问题，即为什么 string.split尽管使用了内存，但速度却快得多。

小开

我想展示如何使用 find _ iter 解决方案为给定的分隔符返回一个生成器，然后使用 itertools 中的成对配方来构建前面的下一个迭代，这个迭代将获得原始分割方法中的实际单词。

from more_itertools import pairwise
import re


string = "dasdha hasud hasuid hsuia dhsuai dhasiu dhaui d"
delimiter = " "
# split according to the given delimiter including segments beginning at the beginning and ending at the end
for prev, curr in pairwise(re.finditer("^|[{0}]+|$".format(delimiter), string)):
print(string[prev.end(): curr.start()])

注:

我使用 prev & curr 代替 prev & next，因为在 python 中重写 next 是一个非常糟糕的主意
这是相当有效的

小开

more_itertools.split_at 为迭代器提供了类似于 str.split的东西。

>>> import more_itertools as mit




>>> list(mit.split_at("abcdcba", lambda x: x == "b"))
[['a'], ['c', 'd', 'c'], ['a']]


>>> "abcdcba".split("b")
['a', 'cdc', 'a']

more_itertools是第三方软件包。

小开

这是一个简单的回答

def gen_str(some_string, sep):
j=0
guard = len(some_string)-1
for i,s in enumerate(some_string):
if s == sep:
yield some_string[j:i]
j=i+1
elif i!=guard:
continue
else:
yield some_string[j:]

小开

没有正则表达式/迭代工具的最愚蠢的方法:

def isplit(text, split='\n'):
while text != '':
end = text.find(split)


if end == -1:
yield text
text = ''
else:
yield text[:end]
text = text[end + 1:]

小开

这是一个老问题了，但这是我对一个高效算法的拙劣贡献:

def str_split(text: str, separator: str) -> Iterable[str]:
i = 0
n = len(text)
while i <= n:
j = text.find(separator, i)
if j == -1:
j = n
yield text[i:j]
i = j + 1

小开

def isplit(text, sep=None, maxsplit=-1):
if not isinstance(text, (str, bytes)):
raise TypeError(f"requires 'str' or 'bytes' but received a '{type(text).__name__}'")
if sep in ('', b''):
raise ValueError('empty separator')


if maxsplit == 0 or not text:
yield text
return


regex = (
re.escape(sep) if sep is not None
else [br'\s+', r'\s+'][isinstance(text, str)]
)
yield from re.split(regex, text, maxsplit=max(0, maxsplit))

小开

下面是一个基于 split 和 maxsplit 的答案，它不使用递归。

def gsplit(todo):
chunk= 100
while todo:
splits = todo.split(maxsplit=chunk)
if len(splits) == chunk:
todo = splits.pop()
else:
todo=None
for item in splits:
yield item