Number of partitions in RDD and performance in Spark

performance apache-spark pyspark rdd

mar tin · Mar 4, 2016 · Viewed 34.2k times · Source

In Pyspark, I can create a RDD from a list and decide how many partitions to have:

sc = SparkContext()
sc.parallelize(xrange(0, 10), 4)

How does the number of partitions I decide to partition my RDD in influence the performance? And how does this depend on the number of core my machine has?

Answer

The primary effect would be by specifying too few partitions or far too many partitions.

Too few partitions You will not utilize all of the cores available in the cluster.

Too many partitions There will be excessive overhead in managing many small tasks.

Between the two the first one is far more impactful on performance. Scheduling too many smalls tasks is a relatively small impact at this point for partition counts below 1000. If you have on the order of tens of thousands of partitions then spark gets very slow.

Number of partitions in RDD and performance in Spark

Answer

Related questions