I've come across the glom()
method on RDD. As per the documentation
Return an RDD created by coalescing all elements within each partition into an array
Does glom
shuffle the data across the partitions or does it only return the partition data as an array? In the latter case, I believe that the same can be achieved using mapPartitions
.
I would also like to know if there are any use cases that benefit from glom
.
Does
glom
shuffle the data across partitions
No, it doesn't
If this is the second case I believe that the same can be achieved using mapPartitions
It can:
rdd.mapPartitions(iter => Iterator(_.toArray))
but the same thing applies to any non shuffling transformation like map
, flatMap
or filter
.
if there are any use cases which benefit from glob.
Any situation where you need to access partition data in a form that is traversable more than once.