MapReduce or Spark?

apache-spark hadoop mapreduce

Nosk · Mar 4, 2014 · Viewed 21.3k times · Source

I have tested hadoop and mapreduce with cloudera and I found it pretty cool, I thought I was the most recent and relevant BigData solution. But few days ago, I found this : https://spark.incubator.apache.org/

A "Lightning fast cluster computing system", able to work on the top of a Hadoop cluster, and apparently able to crush mapreduce. I saw that it worked more in RAM than mapreduce. I think that mapreduce is still relevant when you have to do cluster computing to overcome I/O problems you can have on a single machine. But since Spark can do the jobs that mapreduce do, and may be way more efficient on several operations, isn't it the end of MapReduce ? Or is there something more that MapReduce can do, or can MapReduce be more efficient than Spark in a certain context ?

Answer

Depends what you want to do.

MapReduce's greatest strength is processing lots of large text files. Hadoop's implementation is built around string processing, and it's very I/O heavy.

The problem with MapReduce is that people see the easy parallelism hammer and everything starts to look like a nail. Unfortunately Hadoop's performance for anything other than processing large text files is terrible. If you write a decent parallel code you can often have it finish before Hadoop even spawns its first VM. I've seen differences of 100x in my own codes.

Spark eliminates a lot of Hadoop's overheads, such as the reliance on I/O for EVERYTHING. Instead it keeps everything in-memory. Great if you have enough memory, not so great if you don't.

Remember that Spark is an extension of Hadoop, not a replacement. If you use Hadoop to process logs, Spark probably won't help. If you have more complex, maybe tightly-coupled problems then Spark would help a lot. Also, you may like Spark's Scala interface for on-line computations.

MapReduce or Spark?

Answer

Related questions