Is there an "Explain RDD" in spark

Joseph Victor picture Joseph Victor · May 11, 2015 · Viewed 8.1k times · Source

In particular, if I say

rdd3 = rdd1.join(rdd2)

then when I call rdd3.collect, depending on the Partitioner used, either data is moved between nodes partitions, or the join is done locally on each partition (or, for all I know, something else entirely). This depends on what the RDD paper calls "narrow" and "wide" dependencies, but who knows how good the optimizer is in practice.

Anyways, I can kind of glean from the trace output which thing actually happened, but it would be nice to call rdd3.explain.

Does such a thing exist?

Answer

Justin Pihony picture Justin Pihony · May 11, 2015

I think toDebugString will appease your curiosity.

scala> val data = sc.parallelize(List((1,2)))
data: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[8] at parallelize at <console>:21

scala> val joinedData = data join data
joinedData: org.apache.spark.rdd.RDD[(Int, (Int, Int))] = MapPartitionsRDD[11] at join at <console>:23

scala> joinedData.toDebugString
res4: String =
(8) MapPartitionsRDD[11] at join at <console>:23 []
 |  MapPartitionsRDD[10] at join at <console>:23 []
 |  CoGroupedRDD[9] at join at <console>:23 []
 +-(8) ParallelCollectionRDD[8] at parallelize at <console>:21 []
 +-(8) ParallelCollectionRDD[8] at parallelize at <console>:21 []

Each indentation is a stage, so this should run as two stages.

Also, the optimizer is fairly decent, however I would suggest using DataFrames if you are using 1.3+ as the optimizer there is EVEN better in many cases:)