Since VS Code uses powershell in the terminal.
What factors determine an optimal chunksize
argument to methods like multiprocessing.Pool.map()
? The .map()
method seems to use an arbitrary heuristic for its default chunksize (explained below); what motivates that choice and is there a more thoughtful approach based on some particular situation/setup?
$env:NAME='VALUE'
Example - say that I am:
To learn more:
iterable
to .map()
that has ~15 million elements;processes = os.cpu_count()
within multiprocessing.Pool()
.Example - say that I am:
My naive thinking is to give each of 24 workers an equally-sized chunk, i.e. 15_000_000 / 24
or 625,000. Large chunks should reduce turnover/overhead while fully utilizing all workers. But it seems that this is missing some potential downsides of giving large batches to each worker. Is this an incomplete picture, and what am I missing?
iterable
to .map()
that has ~15 million elements;Part of my question stems from the default logic for if chunksize=None
: both .map()
and .starmap()
call .map_async()
, which looks like this:
def _map_async(self, func, iterable, mapper, chunksize=None, callback=None,
error_callback=None):
# ... (materialize `iterable` to list if it's an iterator)
if chunksize is None:
chunksize, extra = divmod(len(iterable), len(self._pool) * 4) # ????
if extra:
chunksize += 1
if len(iterable) == 0:
chunksize = 0
processes = os.cpu_count()
within multiprocessing.Pool()
.What's the logic behind divmod(len(iterable), len(self._pool) * 4)
? This implies that the chunksize will be closer to 15_000_000 / (24 * 4) == 156_250
. What's the intention in multiplying len(self._pool)
by 4?
My naive thinking is to give each of 24 workers an equally-sized chunk, i.e. 15_000_000 / 24
or 625,000. Large chunks should reduce turnover/overhead while fully utilizing all workers. But it seems that this is missing some potential downsides of giving large batches to each worker. Is this an incomplete picture, and what am I missing?
Part of my question stems from the default logic for if chunksize=None
: both .map()
and .starmap()
call .map_async()
, which looks like this:
def _map_async(self, func, iterable, mapper, chunksize=None, callback=None,
error_callback=None):
# ... (materialize `iterable` to list if it's an iterator)
if chunksize is None:
chunksize, extra = divmod(len(iterable), len(self._pool) * 4) # ????
if extra:
chunksize += 1
if len(iterable) == 0:
chunksize = 0
This makes the resulting chunksize a factor of 4 smaller than my "naive logic" from above, which consists of just dividing the length of the iterable by number of workers in pool._pool
.
What's the logic behind divmod(len(iterable), len(self._pool) * 4)
? This implies that the chunksize will be closer to 15_000_000 / (24 * 4) == 156_250
. What's the intention in multiplying len(self._pool)
by 4?
Lastly, there is also this snippet from the Python docs on .imap()
that further drives my curiosity:
This makes the resulting chunksize a factor of 4 smaller than my "naive logic" from above, which consists of just dividing the length of the iterable by number of workers in
pool._pool
.The
chunksize
argument is the same as the one used by themap()
method. For very long iterables using a large value forchunksize
canLastly, there is also this snippet from the Python docs on
.imap()
that further drives my curiosity:make the job complete much faster than using the default value of 1.
The
chunksize
argument is the same as the one used by themap()
method. For very long iterables using a large value forchunksize
canRelated answer that is helpful but a bit too high-level: Python multiprocessing: why are large chunksizes slower?.