Apache Spark: What is the equivalent implementation of RDD.groupByKey() using RDD.aggregateByKey()?

jsears picture jsears · Jun 26, 2015 · Viewed 8.9k times · Source

The Apache Spark pyspark.RDD API docs mention that groupByKey() is inefficient. Instead, it is recommend to use reduceByKey(), aggregateByKey(), combineByKey(), or foldByKey() instead. This will result in doing some of the aggregation in the workers prior to the shuffle, thus reducing shuffling of data across workers.

Given the following data set and groupByKey() expression, what is an equivalent and efficient implementation (reduced cross-worker data shuffling) that does not utilize groupByKey(), but delivers the same result?

dataset = [("a", 7), ("b", 3), ("a", 8)]
rdd = (sc.parallelize(dataset)
       .groupByKey())
print sorted(rdd.mapValues(list).collect())

Output:

[('a', [7, 8]), ('b', [3])]

Answer

zero323 picture zero323 · Jun 26, 2015

As far as I can tell there is nothing to gain* in this particular case by using aggregateByKey or a similar function. Since you're building a list there is no "real" reduction and amount of data which has to be shuffled is more or less the same.

To really observe some performance gain you need transformations which actually reduces amount of the transfered data for example counting, computing summary statistics, finding unique elements.

Regarding differences benefits of using reduceByKey(), combineByKey(), or foldByKey() there is an important conceptual difference which is easier to see when you consider Scala API singatures.

Both reduceByKey and foldByKey map from RDD[(K, V)] to RDD[(K, V)] while the second one provides additional zero element.

reduceByKey(func: (V, V) ⇒ V): RDD[(K, V)] 
foldByKey(zeroValue: V)(func: (V, V) ⇒ V): RDD[(K, V)]

combineByKey (there is no aggregateByKey, but it is the same type of transformation) transforms from RDD[(K, V)] to RDD[(K, C)]:

combineByKey[C](
   createCombiner: (V) ⇒ C,
   mergeValue: (C, V) ⇒ C,
   mergeCombiners: (C, C) ⇒ C): RDD[(K, C)] 

Going back to your example only combineByKey (and in PySpark aggregateByKey) is really applicable since you are transforming from RDD[(String, Int)] to RDD[(String, List[Int])].

While in a dynamic language like Python it is actually possible to perform such an operation using foldByKey or reduceByKey it makes semantics of the code unclear and to cite @tim-peters "There should be one-- and preferably only one --obvious way to do it" [1].

Difference between aggregateByKey and combineByKey is pretty much the same as between reduceByKey and foldByKey so for a list it is mostly a matter of taste:

def merge_value(acc, x):
    acc.append(x)
    return acc

def merge_combiners(acc1, acc2):
    acc1.extend(acc2)
    return acc1

rdd = (sc.parallelize([("a", 7), ("b", 3), ("a", 8)])
   .combineByKey(
       lambda x: [x],
       lambda u, v: u + [v],
       lambda u1,u2: u1+u2))

In practice you should prefer groupByKey though. PySpark implementation is significantly more optimized compared to naive implementation like the one provided above.

1.Peters, T. PEP 20 -- The Zen of Python. (2004). at https://www.python.org/dev/peps/pep-0020/


* In practice there is actually quite a lot to loose here, especially when using PySpark. Python implementation of groupByKey is significantly more optimized than naive combine by key. You can check Be Smart About groupByKey, created by me and @eliasah for an additional discussion.