如何限制 Python 异步的并发性?

让我们假设我们有一大堆链接要下载,每个链接可能需要不同的时间来下载。我被允许下载使用最大3连接只。现在,我想确保使用异步有效地完成这项工作。

下面是我试图实现的目标: 在任何时候,尽量确保我至少有3个下载运行。

Connection 1: 1---------7---9---
Connection 2: 2---4----6-----
Connection 3: 3-----5---8-----

数字表示下载链接,而连字符表示等待下载。

这是我正在使用的代码

from random import randint
import asyncio


count = 0




async def download(code, permit_download, no_concurrent, downloading_event):
global count
downloading_event.set()
wait_time = randint(1, 3)
print('downloading {} will take {} second(s)'.format(code, wait_time))
await asyncio.sleep(wait_time)  # I/O, context will switch to main function
print('downloaded {}'.format(code))
count -= 1
if count < no_concurrent and not permit_download.is_set():
permit_download.set()




async def main(loop):
global count
permit_download = asyncio.Event()
permit_download.set()
downloading_event = asyncio.Event()
no_concurrent = 3
i = 0
while i < 9:
if permit_download.is_set():
count += 1
if count >= no_concurrent:
permit_download.clear()
loop.create_task(download(i, permit_download, no_concurrent, downloading_event))
await downloading_event.wait()  # To force context to switch to download function
downloading_event.clear()
i += 1
else:
await permit_download.wait()
await asyncio.sleep(9)


if __name__ == '__main__':
loop = asyncio.get_event_loop()
try:
loop.run_until_complete(main(loop))
finally:
loop.close()

结果与预期一致:

downloading 0 will take 2 second(s)
downloading 1 will take 3 second(s)
downloading 2 will take 1 second(s)
downloaded 2
downloading 3 will take 2 second(s)
downloaded 0
downloading 4 will take 3 second(s)
downloaded 1
downloaded 3
downloading 5 will take 2 second(s)
downloading 6 will take 2 second(s)
downloaded 5
downloaded 6
downloaded 4
downloading 7 will take 1 second(s)
downloading 8 will take 1 second(s)
downloaded 7
downloaded 8

但我的问题是:

  1. 目前,我只需要等待9秒钟,以保持主要功能运行,直到下载完成。在退出 main函数之前,是否有一种等待最后一次下载完成的有效方法?(我知道有 asyncio.wait,但是我需要存储所有的任务引用来使它工作)

  2. 什么样的图书馆能完成这样的任务?我知道 javascript 有很多异步库,但 Python 呢?

编辑: 2. 什么样的库能够处理常见的异步模式? (类似于 异步)

55661 次浏览

Before reading the rest of this answer, please note that the idiomatic way of limiting the number of parallel tasks this with asyncio is using asyncio.Semaphore, as shown in Mikhail's answer and elegantly abstracted in Andrei's answer. This answer contains working, but a bit more complicated ways of achieving the same. I am leaving the answer because in some cases this approach can have advantages over a semaphore, specifically when the work to be done is very large or unbounded, and you cannot create all the coroutines in advance. In that case the second (queue-based) solution is this answer is what you want. But in most regular situations, such as parallel download through aiohttp, you should use a semaphore instead.


You basically need a fixed-size pool of download tasks. asyncio doesn't come with a pre-made task pool, but it is easy to create one: simply keep a set of tasks and don't allow it to grow past the limit. Although the question states your reluctance to go down that route, the code ends up much more elegant:

import asyncio, random


async def download(code):
wait_time = random.randint(1, 3)
print('downloading {} will take {} second(s)'.format(code, wait_time))
await asyncio.sleep(wait_time)  # I/O, context will switch to main function
print('downloaded {}'.format(code))


async def main(loop):
no_concurrent = 3
dltasks = set()
i = 0
while i < 9:
if len(dltasks) >= no_concurrent:
# Wait for some download to finish before adding a new one
_done, dltasks = await asyncio.wait(
dltasks, return_when=asyncio.FIRST_COMPLETED)
dltasks.add(loop.create_task(download(i)))
i += 1
# Wait for the remaining downloads to finish
await asyncio.wait(dltasks)

An alternative is to create a fixed number of coroutines doing the downloading, much like a fixed-size thread pool, and feed them work using an asyncio.Queue. This removes the need to manually limit the number of downloads, which will be automatically limited by the number of coroutines invoking download():

# download() defined as above


async def download_worker(q):
while True:
code = await q.get()
await download(code)
q.task_done()


async def main(loop):
q = asyncio.Queue()
workers = [loop.create_task(download_worker(q)) for _ in range(3)]
i = 0
while i < 9:
await q.put(i)
i += 1
await q.join()  # wait for all tasks to be processed
for worker in workers:
worker.cancel()
await asyncio.gather(*workers, return_exceptions=True)

As for your other question, the obvious choice would be aiohttp.

If I'm not mistaken you're searching for asyncio.Semaphore. Example of usage:

import asyncio
from random import randint




async def download(code):
wait_time = randint(1, 3)
print('downloading {} will take {} second(s)'.format(code, wait_time))
await asyncio.sleep(wait_time)  # I/O, context will switch to main function
print('downloaded {}'.format(code))




sem = asyncio.Semaphore(3)




async def safe_download(i):
async with sem:  # semaphore limits num of simultaneous downloads
return await download(i)




async def main():
tasks = [
asyncio.ensure_future(safe_download(i))  # creating task starts coroutine
for i
in range(9)
]
await asyncio.gather(*tasks)  # await moment all downloads done




if __name__ ==  '__main__':
loop = asyncio.get_event_loop()
try:
loop.run_until_complete(main())
finally:
loop.run_until_complete(loop.shutdown_asyncgens())
loop.close()

Output:

downloading 0 will take 3 second(s)
downloading 1 will take 3 second(s)
downloading 2 will take 1 second(s)
downloaded 2
downloading 3 will take 3 second(s)
downloaded 1
downloaded 0
downloading 4 will take 2 second(s)
downloading 5 will take 1 second(s)
downloaded 5
downloaded 3
downloading 6 will take 3 second(s)
downloading 7 will take 1 second(s)
downloaded 4
downloading 8 will take 2 second(s)
downloaded 7
downloaded 8
downloaded 6

An example of async downloading with aiohttp can be found here. Note that aiohttp has a Semaphore equivalent built in, which you can see an example of here. It has a default limit of 100 connections.

The asyncio-pool library does exactly what you need.

https://pypi.org/project/asyncio-pool/


LIST_OF_URLS = ("http://www.google.com", "......")


pool = AioPool(size=3)
await pool.map(your_download_coroutine, LIST_OF_URLS)

Small Update: It's no longer necessary to create the loop. I tweaked the code below. Just cleans things up slightly.

# download(code) is the same


async def main():
no_concurrent = 3
dltasks = set()
for i in range(9):
if len(dltasks) >= no_concurrent:
# Wait for some download to finish before adding a new one
_done, dltasks = await asyncio.wait(dltasks, return_when=asyncio.FIRST_COMPLETED)
dltasks.add(asyncio.create_task(download(i)))
# Wait for the remaining downloads to finish
await asyncio.wait(dltasks)


if __name__ == '__main__':
asyncio.run(main())

I used Mikhail Gerasimov's answer and ended up with this little gem

async def gather_with_concurrency(n, *coros):
semaphore = asyncio.Semaphore(n)


async def sem_coro(coro):
async with semaphore:
return await coro
return await asyncio.gather(*(sem_coro(c) for c in coros))

Which you would run instead of normal gather

await gather_with_concurrency(100, *my_coroutines)

Using semaphore, you can also create a decorator to wrap the function

import asyncio
from functools import wraps
def request_concurrency_limit_decorator(limit=3):
# Bind the default event loop
sem = asyncio.Semaphore(limit)


def executor(func):
@wraps(func)
async def wrapper(*args, **kwargs):
async with sem:
return await func(*args, **kwargs)


return wrapper


return executor

Then, add the decorator to the origin download function.

@request_concurrency_limit_decorator(limit=...)
async def download(...):
...

Now you can call the download function like before, but with Semaphore to limit the concurrency.

await download(...)

It should be noted that when the decorator function is executed, the created Semaphore is bound to the default event loop, so you cannot call asyncio.run to create a new loop. Instead, call asyncio.get_event_loop().run... to use the default event loop.

asyncio.Semaphore RuntimeError: Task got Future attached to a different loop

If you have a generator producing your tasks, there may be more tasks than you can fit in memory simultaneously.

The classic asyncio.Semaphore context-manager pattern races all tasks into memory simultaneously.

I don't like the asyncio.Queue pattern. You can prevent it preloading all the tasks into memory (by setting maxsize=1), but it still requires boilerplate to define, start up and shut down the worker coroutines (which consume from the que), and you have to ensure a worker won't fail if a task throws an exception. It feels unpythonic, as if implementing your own multiprocessing.pool.

Instead, here is an alternative:

sem = asyncio.Semaphore(n := 5) # specify maximum concurrency


async def task_wrapper(args):
try:
await my_task(*args)
finally:
sem.release()


for args in my_generator: # may yield too many to list
await sem.acquire()
asyncio.create_task(task_wrapper(args))


# wait for all tasks to complete
for i in range(n):
await sem.acquire()

This pauses the generator when there are enough active tasks, and lets the event loop clean up finished tasks. Note, for older python versions, replace create_task with ensure_future.