What is the difference between Apache Mahout and Apache Spark's MLlib?

apache-spark mahout apache-spark-mllib

eliasah · May 7, 2014 · Viewed 32k times · Source

Considering a MySQL products database with 10 millions products for an e-commerce website.

I'm trying to set up a classification module to categorize products. I'm using Apache Sqoop to import data from MySQL to Hadoop.

I wanted to use Mahout over it as a Machine Learning framework to use one of it's Classification algorithms, and then I ran into Spark which is provided with MLlib

So what is the difference between the two frameworks?
Mainly, what are the advantages,down-points and limitations of each?

Answer

The main difference will come from underlying frameworks. In case of Mahout it is Hadoop MapReduce and in case of MLib it is Spark. To be more specific - from the difference in per job overhead
If your ML algorithm mapped to the single MR job - main difference will be only startup overhead, which is dozens of seconds for Hadoop MR, and let say 1 second for Spark. So in case of model training it is not that important.
Things will be different if your algorithm is mapped to many jobs. In this case we will have the same difference on overhead per iteration and it can be game changer.
Lets assume that we need 100 iterations, each needed 5 seconds of cluster CPU.

On Spark: it will take 100*5 + 100*1 seconds = 600 seconds.
On Hadoop: MR (Mahout) it will take 100*5+100*30 = 3500 seconds.

In the same time Hadoop MR is much more mature framework then Spark and if you have a lot of data, and stability is paramount - I would consider Mahout as serious alternative.

What is the difference between Apache Mahout and Apache Spark's MLlib?

Answer

Related questions