Why do Scala and frameworks like Spark and Scalding have both reduce
and foldLeft
? So then what's the difference between reduce
and fold
?
A big big difference, not mentioned in any other stackoverflow answer relating to this topic clearly, is that reduce
should be given a commutative monoid, i.e. an operation that is both commutative and associative. This means the operation can be parallelized.
This distinction is very important for Big Data / MPP / distributed computing, and the entire reason why reduce
even exists. The collection can be chopped up and the reduce
can operate on each chunk, then the reduce
can operate on the results of each chunk - in fact the level of chunking need not stop one level deep. We could chop up each chunk too. This is why summing integers in a list is O(log N) if given an infinite number of CPUs.
If you just look at the signatures there is no reason for reduce
to exist because you can achieve everything you can with reduce
with a foldLeft
. The functionality of foldLeft
is a greater than the functionality of reduce
.
But you cannot parallelize a foldLeft
, so its runtime is always O(N) (even if you feed in a commutative monoid). This is because it's assumed the operation is not a commutative monoid and so the cumulated value will be computed by a series of sequential aggregations.
foldLeft
does not assume commutativity nor associativity. It's associativity that gives the ability to chop up the collection, and it's commutativity that makes cumulating easy because order is not important (so it doesn't matter which order to aggregate each of the results from each of the chunks). Strictly speaking commutativity is not necessary for parallelization, for example distributed sorting algorithms, it just makes the logic easier because you don't need to give your chunks an ordering.
If you have a look at the Spark documentation for reduce
it specifically says "... commutative and associative binary operator"
http://spark.apache.org/docs/1.0.0/api/scala/index.html#org.apache.spark.rdd.RDD
Here is proof that reduce
is NOT just a special case of foldLeft
scala> val intParList: ParSeq[Int] = (1 to 100000).map(_ => scala.util.Random.nextInt()).par
scala> timeMany(1000, intParList.reduce(_ + _))
Took 462.395867 milli seconds
scala> timeMany(1000, intParList.foldLeft(0)(_ + _))
Took 2589.363031 milli seconds
Now this is where it gets a little closer to the FP / mathematical roots, and a little trickier to explain. Reduce is defined formally as part of the MapReduce paradigm, which deals with orderless collections (multisets), Fold is formally defined in terms of recursion (see catamorphism) and thus assumes a structure / sequence to the collections.
There is no fold
method in Scalding because under the (strict) Map Reduce programming model we cannot define fold
because chunks do not have an ordering and fold
only requires associativity, not commutativity.
Put simply, reduce
works without an order of cumulation, fold
requires an order of cumulation and it is that order of cumulation that necessitates a zero value NOT the existence of the zero value that distinguishes them. Strictly speaking reduce
should work on an empty collection, because its zero value can by deduced by taking an arbitrary value x
and then solving x op y = x
, but that doesn't work with a non-commutative operation as there can exist a left and right zero value that are distinct (i.e. x op y != y op x
). Of course Scala doesn't bother to work out what this zero value is as that would require doing some mathematics (which are probably uncomputable), so just throws an exception.
It seems (as is often the case in etymology) that this original mathematical meaning has been lost, since the only obvious difference in programming is the signature. The result is that reduce
has become a synonym for fold
, rather than preserving it's original meaning from MapReduce. Now these terms are often used interchangeably and behave the same in most implementations (ignoring empty collections). Weirdness is exacerbated by peculiarities, like in Spark, that we shall now address.
So Spark does have a fold
, but the order in which sub results (one for each partition) are combined (at the time of writing) is the same order in which tasks are completed - and thus non-deterministic. Thanks to @CafeFeed for pointing out that fold
uses runJob
, which after reading through the code I realised that it's non-deterministic. Further confusion is created by Spark having a treeReduce
but no treeFold
.
There is a difference between reduce
and fold
even when applied to non-empty sequences. The former is defined as part of the MapReduce programming paradigm on collections with arbitrary order (http://theory.stanford.edu/~sergei/papers/soda10-mrc.pdf) and one ought to assume operators are commutative in addition to being associative to give deterministic results. The latter is defined in terms of catomorphisms and requires that the collections have a notion of sequence (or are defined recursively, like linked lists), thus do not require commutative operators.
In practice due to the unmathematical nature of programming, reduce
and fold
tend to behave in the same way, either correctly (like in Scala) or incorrectly (like in Spark).
My opinion is that confusion would be avoided if use of the term fold
was completely dropped in Spark. At least spark does have a note in their documentation:
This behaves somewhat differently from fold operations implemented for non-distributed collections in functional languages like Scala.