在 ApacheSpark web UI 中,“阶段跳过”是什么意思?

从我的 Spark 用户界面。跳过是什么意思?

enter image description here

32987 次浏览

Typically it means that data has been fetched from cache and there was no need to re-execute given stage. It is consistent with your DAG which shows that the next stage requires shuffling (reduceByKey). Whenever there is shuffling involved Spark automatically caches generated data:

Shuffle also generates a large number of intermediate files on disk. As of Spark 1.3, these files are preserved until the corresponding RDDs are no longer used and are garbage collected. This is done so the shuffle files don’t need to be re-created if the lineage is re-computed.

Suppose you have a initial data frame with some data. Now you perform couple of transformations on top of it and perform multiple actions on the final data frame. If you had cache a data frame then it would materialize it when you call an action and keep it in memory in materialize form. So when an next action gets called it would go through the whole DAG and in doing that it will see that the data frame was cached so it will skip those stages by utilizing the already ready state that it has in materialized form in the memory.

When it skip the stage then you will see it as skipped in the spark UI and it speeds up your operation as it does not have to calculate the dag from the root and can start its operation after the cache data frame.