什么时候不是使用 Python 生成器的好时机?

这恰恰与 您可以使用 Python 生成器函数做什么?相反: python 生成器、生成器表达式和 itertools模块是现在我最喜欢的 python 特性。在设置操作链以对大量数据执行时,它们尤其有用——我经常在处理 DSV 文件时使用它们。

那么,什么时候 没有是使用生成器、生成器表达式或 itertools函数的好时机呢?

  • 什么时候我应该选择 zip()而不是 itertools.izip(),或者
  • range()/xrange(),或
  • [x for x in foo](x for x in foo)

显然,我们最终需要将生成器“解析”为实际的数据,通常是通过创建一个列表或使用非生成器循环对其进行迭代。有时候我们只需要知道长度。我问的不是这个。

我们使用生成器,这样就不会为中间数据分配新的列表到内存中。这对于大型数据集尤其有意义。它对于小型数据集也有意义吗?是否存在明显的内存/CPU 权衡?

我特别感兴趣的是,如果有人在这方面做了一些分析,根据大开眼界的讨论 列表内涵性能对比 map ()和 filter ()。(备用链接)

20693 次浏览

You should never favor zip over izip, range over xrange, or list comprehensions over generator comprehensions. In Python 3.0 range has xrange-like semantics and zip has izip-like semantics.

List comprehensions are actually clearer like list(frob(x) for x in foo) for those times you need an actual list.

In general, don't use a generator when you need list operations, like len(), reversed(), and so on.

There may also be times when you don't want lazy evaluation (e.g. to do all the calculation up front so you can release a resource). In that case, a list expression might be better.

As you mention, "This especially makes sense for large datasets", I think this answers your question.

If your not hitting any walls, performance-wise, you can still stick to lists and standard functions. Then when you run into problems with performance make the switch.

As mentioned by @u0b34a0f6ae in the comments, however, using generators at the start can make it easier for you to scale to larger datasets.

Profile, Profile, Profile.

Profiling your code is the only way to know if what you're doing has any effect at all.

Most usages of xrange, generators, etc are over static size, small datasets. It's only when you get to large datasets that it really makes a difference. range() vs. xrange() is mostly just a matter of making the code look a tiny little bit more ugly, and not losing anything, and maybe gaining something.

Profile, Profile, Profile.

As far as performance is concerned, I can't think of any times that you would want to use a list over a generator.

I've never found a situation where generators would hinder what you're trying to do. There are, however, plenty of instances where using generators would not help you any more than not using them.

For example:

sorted(xrange(5))

Does not offer any improvement over:

sorted(range(5))

Regarding performance: if using psyco, lists can be quite a bit faster than generators. In the example below, lists are almost 50% faster when using psyco.full()

import psyco
import time
import cStringIO


def time_func(func):
"""The amount of time it requires func to run"""
start = time.clock()
func()
return time.clock() - start


def fizzbuzz(num):
"""That algorithm we all know and love"""
if not num % 3 and not num % 5:
return "%d fizz buzz" % num
elif not num % 3:
return "%d fizz" % num
elif not num % 5:
return "%d buzz" % num
return None


def with_list(num):
"""Try getting fizzbuzz with a list comprehension and range"""
out = cStringIO.StringIO()
for fibby in [fizzbuzz(x) for x in range(1, num) if fizzbuzz(x)]:
print >> out, fibby
return out.getvalue()


def with_genx(num):
"""Try getting fizzbuzz with generator expression and xrange"""
out = cStringIO.StringIO()
for fibby in (fizzbuzz(x) for x in xrange(1, num) if fizzbuzz(x)):
print >> out, fibby
return out.getvalue()


def main():
"""
Test speed of generator expressions versus list comprehensions,
with and without psyco.
"""


#our variables
nums = [10000, 100000]
funcs = [with_list, with_genx]


#  try without psyco 1st
print "without psyco"
for num in nums:
print "  number:", num
for func in funcs:
print func.__name__, time_func(lambda : func(num)), "seconds"
print


#  now with psyco
print "with psyco"
psyco.full()
for num in nums:
print "  number:", num
for func in funcs:
print func.__name__, time_func(lambda : func(num)), "seconds"
print


if __name__ == "__main__":
main()

Results:

without psyco
number: 10000
with_list 0.0519102208309 seconds
with_genx 0.0535933367509 seconds


number: 100000
with_list 0.542204280744 seconds
with_genx 0.557837353115 seconds


with psyco
number: 10000
with_list 0.0286369007033 seconds
with_genx 0.0513424889137 seconds


number: 100000
with_list 0.335414877839 seconds
with_genx 0.580363490491 seconds

You should prefer list comprehensions if you need to keep the values around for something else later and the size of your set is not too large.

For example: you are creating a list that you will loop over several times later in your program.

To some extent you can think of generators as a replacement for iteration (loops) vs. list comprehensions as a type of data structure initialization. If you want to keep the data structure then use list comprehensions.

Use a list instead of a generator when:

1) You need to access the data multiple times (i.e. cache the results instead of recomputing them):

for i in outer:           # used once, okay to be a generator or return a list
for j in inner:       # used multiple times, reusing a list is better
...

2) You need random access (or any access other than forward sequential order):

for i in reversed(data): ...     # generators aren't reversible


s[i], s[j] = s[j], s[i]          # generators aren't indexable

3) You need to join strings (which requires two passes over the data):

s = ''.join(data)                # lists are faster than generators in this use case

4) You are using PyPy which sometimes can't optimize generator code as much as it can with normal function calls and list manipulations.

A generator builds and enumerable list of values. enumerables are useful when iterative process can use the values on demand. It takes time to build your generator, so if the list is millions of records in size, it may be more useful to use sql server to process the data in sql.