If you open the .vscode/launch.json file in your workspace or select Debug > Open Configurations then you should see a set of launch configurations for debugging your code. You can then add to it an env property with a dictionary of string:string.
Here is an example for an ASP.NET Core app from their standard web template setting the ASPNETCORE_ENVIRONMENT to Development :
While running the debugger, go to your variables tab (right click to reopen if not visible) and then open "global" and then "process." There should then be an env section...
Could they make it any harder? Here's what I did: open system properties, click on advanced, add the environment variable, shut down visual studio and start it up again.
Pool's chunksize-algorithm is a heuristic. It provides a simple solution for all imaginable problem scenarios you are trying to stuff into Pool's methods. As a consequence, it cannot be optimized for any specific scenario.
nksize-parameter the processing-function func will be a mapper-function (mapstar or starmapstar). This function maps the user-specified func-parameter on every element of the transmitted chunk of the iterable (--> "map-tasks"). The time this takes, defines a task also as a unit of work.
Taskel
The algorithm arbitrarily divides the iterable in approximately four times more chunks than the naive approach. More chunks mean more overhead, but increased scheduling flexibility. How this answer will show, this leads to a higher worker-utilization on average, but without the guarantee of a shorter overall computation time for every case.
"That's nice to know" you might think, "but how does knowing this help me with my concrete multiprocessing problems?" Well, it doesn't. The more honest short answer is, "there is no short answer", "multiprocessing is complex" and "it depends". An observed symptom can have different roots, even for similar scenarios.
This answer tries to provide you with basic concepts helping you to get a clearer picture of Pool's scheduling black box. It also tries to give you some basic tools at hand for recognizing and avoiding potential cliffs as far they are related to chunksize.
Table of Contents
While the usage of the word "task" for the whole processing of one chunk is matched by code within multiprocessing.pool, there is no indication how a single call to the user-specified func, with one
Part I
element of the chunk as argument(s), should be referred to. To avoid confusion emerging from naming conflicts (think of maxtasksperchild-parameter for Pool's __init__-method), this answer will refer to
Definitions
Parallelization Goals
the single units of work within a task as taskel.
Parallelization Scenarios
Risks of Chunksize > 1
Pool's Chunksize-Algorithm
Quantifying Algorithm Efficiency
6.1 Models
A taskel (from task + element) is the smallest unit of work within a task.
6.2 Parallel Schedule
6.3 Efficiencies
It is the single execution of the function specified with the func-parameter of a Pool-method, called with arguments obtained from a single element of the transmitted chunk.
It is necessary to clarify some important terms first.
PO consists of Python-internal overhead and overhead for inter-process communication (IPC). The per-task overhead within Python comes with the code needed for packaging and unpacking the tasks and its results. IPC-overhead comes with the necessary synchronization of threads and the copying of data between different address spaces (two copy steps needed: parent -> queue -> child). The amount of IPC-overhead is OS-, hardware- and data-size dependent, what makes generalizations about the impact difficult.
2. Parallelization Goals
1. Definitions
Chunk
When using multiprocessing, our overall goal (obviously) is to minimize total processing time for all tasks. To reach this overall goal, our technical goal needs to be optimizing the utilization of hardware resources.
A chunk here is a share of the iterable-argument specified in a pool-method call. How the chunksize gets calculated and what effects this can have, is the topic of this answer.
Some important sub-goals for achieving the technical goal are:
Task
A task's physical representation in a worker-process in terms of data can be seen in the figure below.
minimize parallelization overhead (most famously, but not alone: IPC)
high utilization across all cpu-cores
The figure shows an example call to pool.map(), displayed along a line of code, taken from the multiprocessing.pool.worker function, where a task read from the inqueue gets unpacked. worker is the underlying main-function in the MainThread of a pool-worker-process. The func-argument specified in the pool-method will only match the func-variable inside the worker-function for single-call methods like apply_async and for imap with multiprocessing.pool.worker0. For the rest of the pool-methods with a multiprocessing.pool.worker1-parameter the processing-function func will be a mapper-function (multiprocessing.pool.worker3 or multiprocessing.pool.worker4). This function maps the user-specified func-parameter on every element of the transmitted chunk of the iterable (--> "map-tasks"). The time this takes, defines a multiprocessing.pool.worker6 also as a multiprocessing.pool.worker7.
Taskel
keeping memory usage limited to prevent the OS from excessive paging (trashing)
At first, the tasks need to be computationally heavy (intensive) enough, to earn back the PO we have to pay for parallelization. The relevance of PO decreases with increasing absolute computation time per taskel. Or, to put it the other way around, the bigger the absolute computation time per taskel for your problem, the less relevant gets the need for reducing PO. If your computation will take hours per taskel, the IPC overhead will be negligible in comparison. The primary concern here is to prevent idling worker processes after all tasks have been distributed. Keeping all cores loaded means, we are parallelizing as much as possible.
3. Parallelization Scenarios
What factors determine an optimal chunksize argument to methods like multiprocessing.Pool.map()
While the usage of the word "task" for the whole processing of one chunk is matched by code within multiprocessing.pool, there is no indication how a single call to the user-specified func, with one
The major factor in question is how much computation time may vary across our single taskels. To name it, the choice for an optimal chunksize is determined by the Coefficient of Variation (CV) for computation times per taskel.
element of the chunk as argument(s), should be referred to. To avoid confusion emerging from naming conflicts (think of maxtasksperchild-parameter for Pool's __init__-method), this answer will refer to
The two extreme scenarios on a scale, following from the extent of this variation are:
All taskels need exactly the same computation time.
the single units of work within a task as taskel.
A taskel could take seconds or days to finish.
For better memorability, I will refer to these scenarios as:
Dense Scenario
Wide Scenario
A taskel (from task + element) is the smallest unit of work within a task.
Dense Scenario
In a Dense Scenario it would be desirable to distribute all taskels at once, to keep necessary IPC and context switching at a minimum. This means we want to create only as much chunks, as much worker processes there are. How already stated above, the weight of PO increases with shorter computation times per taskel.
It is the single execution of the function specified with the func-parameter of a Pool-method, called with arguments obtained from a single element of the transmitted chunk.
For maximal throughput, we also want all worker processes busy until all tasks are processed (no idling workers). For this goal, the distributed chunks should be of equal size or close to.
Wide Scenario
A task consists of chunksizetaskels.
The prime example for a Wide Scenario would be an optimization problem, where results either converge quickly or computation can take hours, if not days. Usually it is not predictable what mixture of "light taskels" and "heavy taskels" a task will contain in such a case, hence it's not advisable to distribute too many taskels in a task-batch at once. Distributing less taskels at once than possible, means increasing scheduling flexibility. This is needed here to reach our sub-goal of high utilization of all cores.
Parallelization Overhead (PO)
If Pool methods, by default, would be totally optimized for the Dense Scenario, they would increasingly create suboptimal timings for every problem located closer to the Wide Scenario.
PO consists of Python-internal overhead and overhead for inter-process communication (IPC). The per-task overhead within Python comes with the code needed for packaging and unpacking the tasks and its results. IPC-overhead comes with the necessary synchronization of threads and the copying of data between different address spaces (two copy steps needed: parent -> queue -> child). The amount of IPC-overhead is OS-, hardware- and data-size dependent, what makes generalizations about the impact difficult.
4. Risks of Chunksize > 1
2. Parallelization Goals
Consider this simplified pseudo-code example of a Wide Scenario-iterable, which we want to pass into a pool-method:
Instead of the actual values, we pretend to see the needed computation time in seconds, for simplicity only 1 minute or 1 day.
When using multiprocessing, our overall goal (obviously) is to minimize total processing time for all tasks. To reach this overall goal, our technical goal needs to be optimizing the utilization of hardware resources.
Some important sub-goals for achieving the technical goal are:
We assume the pool has four worker processes (on four cores) and chunksize is set to 2. Because the order will be kept, the chunks send to the workers will be these:
[(60, 60), (86400, 60), (86400, 60), (60, 84600)]
Since we have enough workers and the computation time is high enough, we can say, that every worker process will get a chunk to work on in the first place. (This does not have to be the case for fast completing tasks). Further we can say, the whole processing will take about 86400+60 seconds, because that's the highest total computation time for a chunk in this artificial scenario and we distribute chunks only once.
minimize parallelization overhead (most famously, but not alone: IPC)
Now consider this iterable, which has only one element switching its position compared to the previous iterable:
Just bad luck with the sorting of our iterable nearly doubled (86400+86400) our total processing time! The worker getting the vicious (86400, 86400)-chunk is blocking the second heavy taskel in its task from getting distributed to one of the idling workers already finished with their (60, 60)-chunks. We obviously would not risk such an unpleasant outcome if we set chunksize=1.
keeping memory usage limited to prevent the OS from excessive paging (trashing)
This is the risk of bigger chunksizes. With higher chunksizes we trade scheduling flexibility for less overhead and in cases like above, that's a bad deal.
At first, the tasks need to be computationally heavy (intensive) enough, to earn back the PO we have to pay for parallelization. The relevance of PO decreases with increasing absolute computation time per taskel. Or, to put it the other way around, the bigger the absolute computation time per taskel for your problem, the less relevant gets the need for reducing PO. If your computation will take hours per taskel, the IPC overhead will be negligible in comparison. The primary concern here is to prevent idling worker processes after all tasks have been distributed. Keeping all cores loaded means, we are parallelizing as much as possible.
3. Parallelization Scenarios
How we will see in chapter 6. Quantifying Algorithm Efficiency, bigger chunksizes can also lead to suboptimal results for Dense Scenarios.
What factors determine an optimal chunksize argument to methods like multiprocessing.Pool.map()
5. Pool's Chunksize-Algorithm
Below you will find a slightly modified version of the algorithm inside the source code. As you can see, I cut off the lower part and wrapped it into a function for calculating the chunksize argument externally. I also replaced 4 with a factor parameter and outsourced the len() calls.
# mp_utils.py
def calc_chunksize(n_workers, len_iterable, factor=4):
"""Calculate chunksize argument for Pool-methods.
Resembles source-code within `multiprocessing.pool.Pool._map_async`.
"""
chunksize, extra = divmod(len_iterable, n_workers * factor)
if extra:
chunksize += 1
return chunksize
To ensure we are all on the same page, here's what divmod does:
divmod(x, y) is a builtin function which returns (x//y, x%y).
The major factor in question is how much computation time may vary across our single taskels. To name it, the choice for an optimal chunksize is determined by the Coefficient of Variation (CV) for computation times per taskel.
x // y is the floor division, returning the down rounded quotient from x / y, while
The two extreme scenarios on a scale, following from the extent of this variation are:
x % y is the modulo operation returning the remainder from x / y.
All taskels need exactly the same computation time.
Hence e.g. divmod(10, 3) returns (3, 1).
A taskel could take seconds or days to finish.
Now when you look at chunksize, extra = divmod(len_iterable, n_workers * 4), you will notice n_workers here is the divisor y in x / y and multiplication by 4, without further adjustment through if extra: chunksize +=1 later on, leads to an initial chunksize at least four times smaller (for len_iterable >= n_workers * 4) than it would be otherwise.
For viewing the effect of multiplication by 4 on the intermediate chunksize result consider this function:
def compare_chunksizes(len_iterable, n_workers=4):
"""Calculate naive chunksize, Pool's stage-1 chunksize and the chunksize
for Pool's complete algorithm. Return chunksizes and the real factors by
which naive chunksizes are bigger.
"""
cs_naive = len_iterable // n_workers or 1 # naive approach
cs_pool1 = len_iterable // (n_workers * 4) or 1 # incomplete pool algo.
cs_pool2 = calc_chunksize(n_workers, len_iterable)
real_factor_pool1 = cs_naive / cs_pool1
real_factor_pool2 = cs_naive / cs_pool2
return cs_naive, cs_pool1, cs_pool2, real_factor_pool1, real_factor_pool2
The function above calculates the naive chunksize (cs_naive) and the first-step chunksize of Pool's chunksize-algorithm (cs_pool1), as well as the chunksize for the complete Pool-algorithm (cs_pool2). Further it calculates the real factorsrf_pool1 = cs_naive / cs_pool1 and rf_pool2 = cs_naive / cs_pool2, which tell us how many times the naively calculated chunksizes are bigger than Pool's internal version(s).
Below you see two figures created with output from this function. The left figure just shows the chunksizes for n_workers=4 up until an iterable length of 500. The right figure shows the values for rf_pool1. For iterable length 16, the real factor becomes >=4(for len_iterable >= n_workers * 4) and it's maximum value is 7 for iterable lengths 28-31. That's a massive deviation from the original factor 4 the algorithm converges to for longer iterables. 'Longer' here is relative and depends on the number of specified workers.
For better memorability, I will refer to these scenarios as:
Remember chunksize cs_pool1 still lacks the extra-adjustment with the remainder from divmod contained in cs_pool2 from the complete algorithm.
Dense Scenario
The algorithm goes on with:
if extra:
chunksize += 1
Wide Scenario
Now in cases were there is a remainder (an extra from the divmod-operation), increasing the chunksize by 1 obviously cannot work out for every task. After all, if it would, there would not be a remainder to begin with.
Dense Scenario
In a Dense Scenario it would be desirable to distribute all taskels at once, to keep necessary IPC and context switching at a minimum. This means we want to create only as much chunks, as much worker processes there are. How already stated above, the weight of PO increases with shorter computation times per taskel.
For maximal throughput, we also want all worker processes busy until all tasks are processed (no idling workers). For this goal, the distributed chunks should be of equal size or close to.
How you can see in the figures below, the "extra-treatment" has the effect, that the 40 for rf_pool2 now converges towards 4 from 41 4 and the deviation is somewhat smoother. Standard deviation for n_workers=4 and len_iterable=500 drops from 0.5233 for rf_pool1 to 0.4115 for rf_pool2.
Wide Scenario
Eventually, increasing chunksize by 1 has the effect, that the last task transmitted only has a size of len_iterable % chunksize or chunksize.
The more interesting and how we will see later, more consequential, effect of the extra-treatment however can be observed for the number of generated chunks (n_chunks).
The prime example for a Wide Scenario would be an optimization problem, where results either converge quickly or computation can take hours, if not days. Usually it is not predictable what mixture of "light taskels" and "heavy taskels" a task will contain in such a case, hence it's not advisable to distribute too many taskels in a task-batch at once. Distributing less taskels at once than possible, means increasing scheduling flexibility. This is needed here to reach our sub-goal of high utilization of all cores.
For long enough iterables, Pool's completed chunksize-algorithm (n_pool2 in the figure below) will stabilize the number of chunks at n_chunks == n_workers * 4.
If Pool methods, by default, would be totally optimized for the Dense Scenario, they would increasingly create suboptimal timings for every problem located closer to the Wide Scenario.
In contrast, the naive algorithm (after an initial burp) keeps alternating between n_chunks == n_workers and n_chunks == n_workers + 1 as the length of the iterable grows.
4. Risks of Chunksize > 1
Below you will find two enhanced info-functions for Pool's and the naive chunksize-algorithm. The output of these functions will be needed in the next chapter.
Instead of the actual values, we pretend to see the needed computation time in seconds, for simplicity only 1 minute or 1 day.
6. Quantifying Algorithm Efficiency
We assume the pool has four worker processes (on four cores) and chunksize is set to 2. Because the order will be kept, the chunks send to the workers will be these:
[(60, 60), (86400, 60), (86400, 60), (60, 84600)]
Now, after we have seen how the output of Pool's chunksize-algorithm looks different compared to output from the naive algorithm...
Since we have enough workers and the computation time is high enough, we can say, that every worker process will get a chunk to work on in the first place. (This does not have to be the case for fast completing tasks). Further we can say, the whole processing will take about 86400+60 seconds, because that's the highest total computation time for a chunk in this artificial scenario and we distribute chunks only once.
Now consider this iterable, which has only one element switching its position compared to the previous iterable:
How to tell if Pool's approach actually improves something?
...and the corresponding chunks:
[(60, 60), (86400, 86400), (60, 60), (60, 84600)]
Just bad luck with the sorting of our iterable nearly doubled (86400+86400) our total processing time! The worker getting the vicious (86400, 86400)-chunk is blocking the second heavy taskel in its task from getting distributed to one of the idling workers already finished with their (60, 60)-chunks. We obviously would not risk such an unpleasant outcome if we set chunksize=1.
And what exactly could this something be?
This is the risk of bigger chunksizes. With higher chunksizes we trade scheduling flexibility for less overhead and in cases like above, that's a bad deal.
How we will see in chapter 6. Quantifying Algorithm Efficiency, bigger chunksizes can also lead to suboptimal results for Dense Scenarios.
As shown in the previous chapter, for longer iterables (a bigger number of taskels), Pool's chunksize-algorithm approximately divides the iterable into four times more chunks than the naive method. Smaller chunks mean more tasks and more tasks mean more Parallelization Overhead (PO), a cost which must be weighed against the benefit of increased scheduling-flexibility (recall "Risks of Chunksize>1").
5. Pool's Chunksize-Algorithm
Below you will find a slightly modified version of the algorithm inside the source code. As you can see, I cut off the lower part and wrapped it into a function for calculating the chunksize argument externally. I also replaced 4 with a factor parameter and outsourced the len() calls.
# mp_utils.py
def calc_chunksize(n_workers, len_iterable, factor=4):
"""Calculate chunksize argument for Pool-methods.
Resembles source-code within `multiprocessing.pool.Pool._map_async`.
"""
chunksize, extra = divmod(len_iterable, n_workers * factor)
if extra:
chunksize += 1
return chunksize
To ensure we are all on the same page, here's what divmod does:
divmod(x, y) is a builtin function which returns (x//y, x%y).
For rather obvious reasons, Pool's basic chunksize-algorithm cannot weigh scheduling-flexibility against PO for us. IPC-overhead is OS-, hardware- and data-size dependent. The algorithm cannot know on what hardware we run our code, nor does it have a clue how long a taskel will take to finish. It's a heuristic providing basic functionality for all possible scenarios. This means it cannot be optimized for any scenario in particular. As mentioned before, PO also becomes increasingly less of a concern with increasing computation times per taskel (negative correlation).
When you recall the Parallelization Goals from chapter 2, one bullet-point was:
x // y is the floor division, returning the down rounded quotient from x / y, while
high utilization across all cpu-cores
x % y is the modulo operation returning the remainder from x / y.
Hence e.g. divmod(10, 3) returns (3, 1).
Now when you look at chunksize, extra = divmod(len_iterable, n_workers * 4), you will notice n_workers here is the divisor y in x / y and multiplication by 4, without further adjustment through if extra: chunksize +=1 later on, leads to an initial chunksize at least four times smaller (for len_iterable >= n_workers * 4) than it would be otherwise.
For viewing the effect of multiplication by 4 on the intermediate chunksize result consider this function:
def compare_chunksizes(len_iterable, n_workers=4):
"""Calculate naive chunksize, Pool's stage-1 chunksize and the chunksize
for Pool's complete algorithm. Return chunksizes and the real factors by
which naive chunksizes are bigger.
"""
cs_naive = len_iterable // n_workers or 1 # naive approach
cs_pool1 = len_iterable // (n_workers * 4) or 1 # incomplete pool algo.
cs_pool2 = calc_chunksize(n_workers, len_iterable)
real_factor_pool1 = cs_naive / cs_pool1
real_factor_pool2 = cs_naive / cs_pool2
return cs_naive, cs_pool1, cs_pool2, real_factor_pool1, real_factor_pool2
The previously mentioned something, Pool's chunksize-algorithm can try to improve is the minimization of idling worker-processes, respectively the utilization of cpu-cores.
The function above calculates the naive chunksize (cs_naive) and the first-step chunksize of Pool's chunksize-algorithm (cs_pool1), as well as the chunksize for the complete Pool-algorithm (cs_pool2). Further it calculates the real factorsrf_pool1 = cs_naive / cs_pool1 and rf_pool2 = cs_naive / cs_pool2, which tell us how many times the naively calculated chunksizes are bigger than Pool's internal version(s).
Below you see two figures created with output from this function. The left figure just shows the chunksizes for n_workers=4 up until an iterable length of 500. The right figure shows the values for rf_pool1. For iterable length 16, the real factor becomes >=4(for len_iterable >= n_workers * 4) and it's maximum value is 7 for iterable lengths 28-31. That's a massive deviation from the original factor 4 the algorithm converges to for longer iterables. 'Longer' here is relative and depends on the number of specified workers.
A repeating question on SO regarding multiprocessing.Pool is asked by people wondering about unused cores / idling worker-processes in situations where you would expect all worker-processes busy. While this can have many reasons, idling worker-processes towards the end of a computation are an observation we can often make, even with Dense Scenarios (equal computation times per taskel) in cases where the number of workers is not a divisor of the number of chunks (n_chunks % n_workers > 0).
Remember chunksize cs_pool1 still lacks the extra-adjustment with the remainder from divmod contained in cs_pool2 from the complete algorithm.
The question now is:
The algorithm goes on with:
if extra:
chunksize += 1
How can we practically translate our understanding of chunksizes into something which enables us to explain observed worker-utilization, or even compare the efficiency of different algorithms in that regard?
6.1 Models
Now in cases were there is a remainder (an extra from the divmod-operation), increasing the chunksize by 1 obviously cannot work out for every task. After all, if it would, there would not be a remainder to begin with.
For gaining deeper insights here, we need a form of abstraction of parallel computations which simplifies the overly complex reality down to a manageable degree of complexity, while preserving significance within defined boundaries. Such an abstraction is called a model. An implementation of such a "Parallelization Model" (PM) generates worker-mapped meta-data (timestamps) as real computations would, if the data were to be collected. The model-generated meta-data allows predicting metrics of parallel computations under certain constraints.
How you can see in the figures below, the "extra-treatment" has the effect, that the 40 for rf_pool2 now converges towards 4 from 41 4 and the deviation is somewhat smoother. Standard deviation for n_workers=4 and len_iterable=500 drops from 0.5233 for rf_pool1 to 0.4115 for rf_pool2.
Eventually, increasing chunksize by 1 has the effect, that the last task transmitted only has a size of len_iterable % chunksize or chunksize.
The more interesting and how we will see later, more consequential, effect of the extra-treatment however can be observed for the number of generated chunks (n_chunks).
One of two sub-models within the here defined PM is the Distribution Model (DM). The DM explains how atomic units of work (taskels) are distributed over parallel workers and time, when no other factors than the respective chunksize-algorithm, the number of workers, the input-iterable (number of taskels) and their computation duration is considered. This means any form of overhead is not included.
For long enough iterables, Pool's completed chunksize-algorithm (n_pool2 in the figure below) will stabilize the number of chunks at n_chunks == n_workers * 4.
In contrast, the naive algorithm (after an initial burp) keeps alternating between n_chunks == n_workers and n_chunks == n_workers + 1 as the length of the iterable grows.
Below you will find two enhanced info-functions for Pool's and the naive chunksize-algorithm. The output of these functions will be needed in the next chapter.
Now, after we have seen how the output of Pool's chunksize-algorithm looks different compared to output from the naive algorithm...
How to tell if Pool's approach actually improves something?
And what exactly could this something be?
For obtaining a complete PM, the DM is extended with an Overhead Model (OM), representing various forms of Parallelization Overhead (PO). Such a model needs to be calibrated for each node individually (hardware-, OS-dependencies). How many forms of overhead are represented in a OM is left open and so multiple OMs with varying degrees of complexity can exist. Which level of accuracy the implemented OM needs is determined by the overall weight of PO for the specific computation. Shorter taskels lead to a higher weight of PO, which in turn requires a more precise OM if we were attempting to DM0 DM1.
6.2 Parallel Schedule (PS)
As shown in the previous chapter, for longer iterables (a bigger number of taskels), Pool's chunksize-algorithm approximately divides the iterable into four times more chunks than the naive method. Smaller chunks mean more tasks and more tasks mean more Parallelization Overhead (PO), a cost which must be weighed against the benefit of increased scheduling-flexibility (recall "Risks of Chunksize>1").
The Parallel Schedule is a two-dimensional representation of the parallel computation, where the x-axis represents time and the y-axis represents a pool of parallel workers. The number of workers and the total computation time mark the extend of a rectangle, in which smaller rectangles are drawn in. These smaller rectangles represent atomic units of work (taskels).
For rather obvious reasons, Pool's basic chunksize-algorithm cannot weigh scheduling-flexibility against PO for us. IPC-overhead is OS-, hardware- and data-size dependent. The algorithm cannot know on what hardware we run our code, nor does it have a clue how long a taskel will take to finish. It's a heuristic providing basic functionality for all possible scenarios. This means it cannot be optimized for any scenario in particular. As mentioned before, PO also becomes increasingly less of a concern with increasing computation times per taskel (negative correlation).
Below you find the visualization of a PS drawn with data from the DM of Pool's chunksize-algorithm for the Dense Scenario.
When you recall the Parallelization Goals from chapter 2, one bullet-point was:
The x-axis is sectioned into equal units of time, where each unit stands for the computation time a taskel requires.
high utilization across all cpu-cores
The y-axis is divided into the number of worker-processes the pool uses.
A taskel here is displayed as the smallest cyan-colored rectangle, put into a timeline (a schedule) of an anonymized worker-process.
A task is one or multiple taskels in a worker-timeline continuously highlighted with the same hue.
Idling time units are represented through red colored tiles.
The Parallel Schedule is partitioned into sections. The last section is the tail-section.
The previously mentioned something, Pool's chunksize-algorithm can try to improve is the minimization of idling worker-processes, respectively the utilization of cpu-cores.
The names for the composed parts can be seen in the picture below.
A repeating question on SO regarding multiprocessing.Pool is asked by people wondering about unused cores / idling worker-processes in situations where you would expect all worker-processes busy. While this can have many reasons, idling worker-processes towards the end of a computation are an observation we can often make, even with Dense Scenarios (equal computation times per taskel) in cases where the number of workers is not a divisor of the number of chunks (n_chunks % n_workers > 0).
In a complete PM including an OM, the Idling Share is not limited to the tail, but also comprises space between tasks and even between taskels.
The question now is:
How can we practically translate our understanding of chunksizes into something which enables us to explain observed worker-utilization, or even compare the efficiency of different algorithms in that regard?
6.1 Models
For gaining deeper insights here, we need a form of abstraction of parallel computations which simplifies the overly complex reality down to a manageable degree of complexity, while preserving significance within defined boundaries. Such an abstraction is called a model. An implementation of such a "Parallelization Model" (PM) generates worker-mapped meta-data (timestamps) as real computations would, if the data were to be collected. The model-generated meta-data allows predicting metrics of parallel computations under certain constraints.
6.3 Efficiencies
The Models introduced above allow quantifying the rate of worker-utilization. We can distinguish:
Distribution Efficiency (DE) - calculated with help of a DM (or a simplified method for the Dense Scenario).
Parallelization Efficiency (PE) - either calculated with help of a calibrated PM (prediction) or calculated from meta-data of real computations.
It's important to note, that calculated efficiencies do not automatically correlate with faster overall computation for a given parallelization problem. Worker-utilization in this context only distinguishes between a worker having a started, yet unfinished taskel and a worker not having such an "open" taskel. That means, possible idling during the time span of a taskel is not registered.
One of two sub-models within the here defined PM is the Distribution Model (DM). The DM explains how atomic units of work (taskels) are distributed over parallel workers and time, when no other factors than the respective chunksize-algorithm, the number of workers, the input-iterable (number of taskels) and their computation duration is considered. This means any form of overhead is not included.
All above mentioned efficiencies are basically obtained by calculating the quotient of the division Busy Share / Parallel Schedule. The difference between DE and PE comes with the Busy Share
occupying a smaller portion of the overall Parallel Schedule for the overhead-extended PM.
This answer will further only discuss a simple method to calculate DE for the Dense Scenario. This is sufficiently adequate to compare different chunksize-algorithms, since...
... the DM is the part of the PM, which changes with different chunksize-algorithms employed.
... the Dense Scenario with equal computation durations per taskel depicts a "stable state", for which these time spans drop out of the equation. Any other scenario would just lead to random results since the ordering of taskels would matter.
For obtaining a complete PM, the DM is extended with an Overhead Model (OM), representing various forms of Parallelization Overhead (PO). Such a model needs to be calibrated for each node individually (hardware-, OS-dependencies). How many forms of overhead are represented in a OM is left open and so multiple OMs with varying degrees of complexity can exist. Which level of accuracy the implemented OM needs is determined by the overall weight of PO for the specific computation. Shorter taskels lead to a higher weight of PO, which in turn requires a more precise OM if we were attempting to DM0 DM1.
6.3.1 Absolute Distribution Efficiency (ADE)
6.2 Parallel Schedule (PS)
The Parallel Schedule is a two-dimensional representation of the parallel computation, where the x-axis represents time and the y-axis represents a pool of parallel workers. The number of workers and the total computation time mark the extend of a rectangle, in which smaller rectangles are drawn in. These smaller rectangles represent atomic units of work (taskels).
This basic efficiency can be calculated in general by dividing the Busy Share through the whole potential of the Parallel Schedule:
Absolute Distribution Efficiency (ADE) = Busy Share / Parallel Schedule
For the Dense Scenario, the simplified calculation-code looks like this:
# mp_utils.py
def calc_ade(n_workers, len_iterable, n_chunks, chunksize, last_chunk):
"""Calculate Absolute Distribution Efficiency (ADE).
`len_iterable` is not used, but contained to keep a consistent signature
with `calc_rde`.
"""
if n_workers == 1:
return 1
potential = (
((n_chunks // n_workers + (n_chunks % n_workers > 1)) * chunksize)
+ (n_chunks % n_workers == 1) * last_chunk
) * n_workers
n_full_chunks = n_chunks - (chunksize > last_chunk)
taskels_in_regular_chunks = n_full_chunks * chunksize
real = taskels_in_regular_chunks + (chunksize > last_chunk) * last_chunk
ade = real / potential
return ade
Below you find the visualization of a PS drawn with data from the DM of Pool's chunksize-algorithm for the Dense Scenario.
The x-axis is sectioned into equal units of time, where each unit stands for the computation time a taskel requires.
If there is no Idling Share, Busy Share will be equal to Parallel Schedule, hence we get an ADE of 100%. In our simplified model, this is a scenario where all available processes will be busy through the whole time needed for processing all tasks. In other words, the whole job gets effectively parallelized to 100 percent.
The y-axis is divided into the number of worker-processes the pool uses.
A taskel here is displayed as the smallest cyan-colored rectangle, put into a timeline (a schedule) of an anonymized worker-process.
A task is one or multiple taskels in a worker-timeline continuously highlighted with the same hue.
Idling time units are represented through red colored tiles.
But why do I keep referring to PE as absolutePE here?
To comprehend that, we have to consider a possible case for the chunksize (cs) which ensures maximal scheduling flexibility (also, the number of Highlanders there can be. Coincidence?):
The Parallel Schedule is partitioned into sections. The last section is the tail-section.
The names for the composed parts can be seen in the picture below.
__________________________________~ ONE ~__________________________________
If we, for example, have four worker-processes and 37 taskels, there will be idling workers even with chunksize=1, just because n_workers=4 is not a divisor of 37. The remainder of dividing 37 / 4 is 1. This single remaining taskel will have to be processed by a sole worker, while the remaining three are idling.
Likewise, there will still be one idling worker with 39 taskels, how you can see pictured below.
When you compare the upper Parallel Schedule for chunksize=1 with the below version for chunksize=3, you will notice that the upper Parallel Schedule is smaller, the timeline on the x-axis shorter. It should become obvious now, how bigger chunksizes unexpectedly also can lead to increased overall computation times, even for Dense Scenarios.
But why not just use the length of the x-axis for efficiency calculations?
Because the overhead is not contained in this model. It will be different for both chunksizes, hence the x-axis is not really directly comparable. The overhead can still lead to a longer total computation time like shown in case 2 from the figure below.
In a complete PM including an OM, the Idling Share is not limited to the tail, but also comprises space between tasks and even between taskels.
6.3.2 Relative Distribution Efficiency (RDE)
6.3 Efficiencies
The Models introduced above allow quantifying the rate of worker-utilization. We can distinguish:
Distribution Efficiency (DE) - calculated with help of a DM (or a simplified method for the Dense Scenario).
The ADE value does not contain the information if a better distribution of taskels is possible with chunksize set to 1. Better here still means a smaller Idling Share.
Parallelization Efficiency (PE) - either calculated with help of a calibrated PM (prediction) or calculated from meta-data of real computations.
To get a DE value adjusted for the maximum possible DE, we have to divide the considered ADE through the ADE we get for chunksize=1.
It's important to note, that calculated efficiencies do not automatically correlate with faster overall computation for a given parallelization problem. Worker-utilization in this context only distinguishes between a worker having a started, yet unfinished taskel and a worker not having such an "open" taskel. That means, possible idling during the time span of a taskel is not registered.
Relative Distribution Efficiency (RDE) = ADE_cs_x / ADE_cs_1
RDE, how defined here, in essence is a tale about the tail of a Parallel Schedule. RDE is influenced by the maximum effective chunksize contained in the tail. (This tail can be of x-axis length chunksize or last_chunk.)
This has the consequence, that RDE naturally converges to 100% (even) for all sorts of "tail-looks" like shown in the figure below.
A low RDE ...
is a strong hint for optimization potential.
All above mentioned efficiencies are basically obtained by calculating the quotient of the division Busy Share / Parallel Schedule. The difference between DE and PE comes with the Busy Share
naturally gets less likely for longer iterables, because the relative tail-portion of the overall Parallel Schedule shrinks.
occupying a smaller portion of the overall Parallel Schedule for the overhead-extended PM.