Spark Scala Understanding reduceByKey(_ + _)

scala apache-spark word-count bigdata

Elsayed · May 1, 2016 · Viewed 10.2k times · Source

I can't understand reduceByKey(_ + _) in the first example of spark with scala

object WordCount {
def main(args: Array[String]): Unit = {
val inputPath = args(0)
val outputPath = args(1)
val sc = new SparkContext()
val lines = sc.textFile(inputPath)
val wordCounts = lines.flatMap {line => line.split(" ")}
.map(word => (word, 1))
.reduceByKey(_ + _)  **I cant't understand this line**
wordCounts.saveAsTextFile(outputPath)
}
}

Answer

Reduce takes two elements and produce a third after applying a function to the two parameters.

The code you shown is equivalent to the the following

 reduceByKey((x,y)=> x + y)

Instead of defining dummy variables and write a lambda, Scala is smart enough to figure out that what you trying achieve is applying a func (sum in this case) on any two parameters it receives and hence the syntax

 reduceByKey(_ + _)

Spark Scala Understanding reduceByKey(_ + _)

Answer

Related questions