RDD Aggregate in spark

Lijju Mathew picture Lijju Mathew · May 4, 2016 · Viewed 17.8k times · Source

I am an Apache Spark learner and have come across a RDD action aggregate which I have no clue of how it functions. Can some one spell out and explain in detail step by step how did we arrive at the below result for the code here

RDD input = {1,2,3,3}

RDD Aggregate function :

rdd.aggregate((0, 0))
((x, y) =>
(x._1 + y, x._2 + 1),
(x, y) =>
(x._1 + y._1, x._2 + y._2))

output : {9,4}

Thanks

Answer

zero323 picture zero323 · May 4, 2016

If you are not sure what is going on it is best to follow the types. Omitting implicit ClassTag for brevity we start with something like this

abstract class RDD[T] extends Serializable with Logging 

def aggregate[U](zeroValue: U)(seqOp: (U, T) ⇒ U, combOp: (U, U) ⇒ U): U 

If you ignore all the additional parameters you'll see that aggregate is a function which maps from RDD[T] to U. It means that the type of the values in the input RDD doesn't have to be the same as the type of the output value. So it is clearly different than for example reduce:

def reduce(func: (T, T) ⇒ T): T 

or fold:

def fold(zeroValue: T)(op: (T, T) => T): T

The same as fold, aggregate requires a zeroValue. How to choose it? It should be an identity (neutral) element with respect to combOp.

You also have to provide two functions:

  • seqOp which maps from (U, T) to U
  • combOp which maps from (U, U) to U

Just based on this signatures you should already see that only seqOp may access the raw data. It takes some value of type U another one of type T and returns a value of type U. In your case it is a function with a following signature

((Int, Int), Int) => (Int, Int) 

At this point you probably suspect it is used for some kind of fold-like operation.

The second function takes two arguments of type U and returns a value of type U. As stated before it should be clear that it doesn't touch the original data and can operate only on the values already processed by the seqOp. In your case this function has a signature as follows:

((Int, Int), (Int, Int)) => (Int, Int) 

So how can we get all of that together?

  1. First each partition is aggregated using standard Iterator.aggregate with zeroValue, seqOp and combOp passed as z, seqop and combop respectivelly. Since InterruptibleIterator used internally doesn't override aggregate it should be executed as a simple foldLeft(zeroValue)(seqOp)

  2. Next partial results collected from each partition are aggregated using combOp

Lets assume that input RDD has three partitions with following distribution of values:

  • Iterator(1, 2)
  • Iterator(2, 3)
  • Iterator()

You can expect that execution, ignoring absolute order, will be equivalent to something like this:

val seqOp = (x: (Int, Int), y: Int) => (x._1 + y, x._2 + 1)
val combOp = (x: (Int, Int), y: (Int, Int)) => (x._1 + y._1, x._2 + y._2)

Seq(Iterator(1, 2), Iterator(3, 3), Iterator())
  .map(_.foldLeft((0, 0))(seqOp))
  .reduce(combOp)

foldLeft for a single partition can look like this:

Iterator(1, 2).foldLeft((0, 0))(seqOp)
Iterator(2).foldLeft((1, 1))(seqOp)
(3, 2)

and over all partitions

Seq((3,2), (6,2), (0,0))

which combined will give you observed result:

(3 + 6 + 0, 2 + 2 + 0)
(9, 4)

In general this is a common pattern you will find all over Spark where you pass neutral value, a function used to process values per partition and a function used to merge partial aggregates from different partitions. Some other examples include:

  • aggregateByKey
  • User Defined Aggregate Functions
  • Aggregators on Spark Datasets.