Spark CollectAsMap

apache-spark distributed-computing worker

Χρήστος Μάλλιος · Apr 22, 2015 · Viewed 17k times · Source

I would like to know how collectAsMap works in Spark. More specifically I would like to know where the aggregation of the data of all partitions will take place? The aggregation either takes place in master or in workers. In the first case each worker send its data on master and when the master collects the data from each one worker, then master will aggregate the results. In the second case the workers are responsible to aggregate the results(after they exchange the data among them) and after that the results will be sent to the master.

It is critical for me to find a way so as the master to be able collect the data from each partition separately, without workers exchange data.

Answer

You can see how they are doing collectAsMap here. Since the RDD type is a tuple it looks like they just use the normal RDD collect and then translate the tuples into a map of key,value pairs. But they do mention in the comment that multi-map isn't supported, so you need a 1-to-1 key/value mapping across your data.

collectAsMap function

What collect does is execute a Spark job and get back the results from each partition from the workers and aggregates them with a reduce/concat phase on the driver.

collect function

So given that, it should be the case that the driver collects the data from each partition separately without workers exchanging data to perform collectAsMap.

Note, if you are doing transformations on your RDD prior to using collectAsMap that cause a shuffle to occur, there may be an intermediate step that causes workers to exchange data amongst themselves. Check out your cluster master's application UI to see more information regarding how spark is executing your application.

Spark CollectAsMap

Answer

Related questions