multiprocessing: Understanding logic behind `chunksize`

Since VS Code uses powershell in the terminal.

What factors determine an optimal chunksize argument to methods like multiprocessing.Pool.map()? The .map() method seems to use an arbitrary heuristic for its default chunksize (explained below); what motivates that choice and is there a more thoughtful approach based on some particular situation/setup?

The powershell command is

$env:NAME='VALUE'

Example - say that I am:

Example - say that I am:

    My naive thinking is to give each of 24 workers an equally-sized chunk, i.e. 15_000_000 / 24 or 625,000. Large chunks should reduce turnover/overhead while fully utilizing all workers. But it seems that this is missing some potential downsides of giving large batches to each worker. Is this an incomplete picture, and what am I missing?


  • Passing an iterable to .map() that has ~15 million elements;
  • Part of my question stems from the default logic for if chunksize=None: both .map() and .starmap() call .map_async(), which looks like this:

    def _map_async(self, func, iterable, mapper, chunksize=None, callback=None,
    error_callback=None):
    # ... (materialize `iterable` to list if it's an iterator)
    if chunksize is None:
    chunksize, extra = divmod(len(iterable), len(self._pool) * 4)  # ????
    if extra:
    chunksize += 1
    if len(iterable) == 0:
    chunksize = 0
    
  • Working on a machine with 24 cores and using the default processes = os.cpu_count() within multiprocessing.Pool().

What's the logic behind divmod(len(iterable), len(self._pool) * 4)? This implies that the chunksize will be closer to 15_000_000 / (24 * 4) == 156_250. What's the intention in multiplying len(self._pool) by 4?

My naive thinking is to give each of 24 workers an equally-sized chunk, i.e. 15_000_000 / 24 or 625,000. Large chunks should reduce turnover/overhead while fully utilizing all workers. But it seems that this is missing some potential downsides of giving large batches to each worker. Is this an incomplete picture, and what am I missing?


Part of my question stems from the default logic for if chunksize=None: both .map() and .starmap() call .map_async(), which looks like this:

def _map_async(self, func, iterable, mapper, chunksize=None, callback=None,
error_callback=None):
# ... (materialize `iterable` to list if it's an iterator)
if chunksize is None:
chunksize, extra = divmod(len(iterable), len(self._pool) * 4)  # ????
if extra:
chunksize += 1
if len(iterable) == 0:
chunksize = 0

This makes the resulting chunksize a factor of 4 smaller than my "naive logic" from above, which consists of just dividing the length of the iterable by number of workers in pool._pool.

What's the logic behind divmod(len(iterable), len(self._pool) * 4)? This implies that the chunksize will be closer to 15_000_000 / (24 * 4) == 156_250. What's the intention in multiplying len(self._pool) by 4?

Lastly, there is also this snippet from the Python docs on .imap() that further drives my curiosity:

This makes the resulting chunksize a factor of 4 smaller than my "naive logic" from above, which consists of just dividing the length of the iterable by number of workers in pool._pool.

The chunksize argument is the same as the one used by the map() method. For very long iterables using a large value for chunksize can

Lastly, there is also this snippet from the Python docs on .imap() that further drives my curiosity:

make the job complete much faster than using the default value of 1.


The chunksize argument is the same as the one used by the map() method. For very long iterables using a large value for chunksize can

Related answer that is helpful but a bit too high-level: Python multiprocessing: why are large chunksizes slower?.

31853 次浏览

I think that part of what you're missing is that your naive estimate assumes that each unit of work takes the same amount of time in which case your strategy would be the best. But if some jobs finish sooner than others then some cores may become idle waiting for the slow jobs to finish.

make the job complete much faster than using the default value of 1.


Thus, by breaking the chunks up into 4 times more pieces, then if one chunk finished early that core can start the next chunk ( while the other cores keep working on their slower chunk).

I don't know why they picked the factor 4 exactly but it would be a trade off between minimising the overhead of the map code ( which wants the largest chunks possible) and balancing chunks taking different amount of times ( which wants the smallest chunk possible).

Please find Part II of this answer here.

Python 3.7.1, Ubuntu 18.04.2, Intel® Core™ i7-2600K CPU @ 3.40GHz × 8

The input values which will be varied are the number of iterations in the for-loop

A bit late to the party, but for projects not wanting to create a Dockerfile for the installer, it is also possible to run the installer from an Ephemeral container. This gives full access to the Node CLI, without having to install it on the host machine.

(30k, 30M, 600M) and the additionally send data size (per taskel, numpy-ndarray: 0 MiB, 50 MiB