What is a glom?. How it is different from mapPartitions?

nagendra picture nagendra · Mar 2, 2016 · Viewed 11.5k times · Source

I've come across the glom() method on RDD. As per the documentation

Return an RDD created by coalescing all elements within each partition into an array

Does glom shuffle the data across the partitions or does it only return the partition data as an array? In the latter case, I believe that the same can be achieved using mapPartitions.

I would also like to know if there are any use cases that benefit from glom.

Answer

zero323 picture zero323 · Mar 2, 2016

Does glom shuffle the data across partitions

No, it doesn't

If this is the second case I believe that the same can be achieved using mapPartitions

It can:

rdd.mapPartitions(iter => Iterator(_.toArray))

but the same thing applies to any non shuffling transformation like map, flatMap or filter.

if there are any use cases which benefit from glob.

Any situation where you need to access partition data in a form that is traversable more than once.