MySQL Cluster vs. Hadoop for handling big data

Tobi Weißhaar picture Tobi Weißhaar · Jan 29, 2014 · Viewed 13.8k times · Source

I want to know the advantages/disadvantages of using a MySQL Cluster and using the Hadoop framework. What is the better solution. I would like to read your opinion.

I think the advantages of using a MySQL Cluster are:

  1. high availability
  2. good scalability
  3. high performance / real time data access
  4. you can use commodity hardware

And I don't see a disadvantage! Are there any disadvantages that Hadoop do not has?

The advantages of Hadoop with Hive on top of it are:

  1. also good scalability
  2. you can also use commodity hardware
  3. the ability to run in heterogenous environments
  4. parallel computing with the MapReduce framework
  5. Hive with HiveQL

and the disadvantage is:

  1. no real time data access. It may takes minutes or hours to analyze the data.

So in my opinion for handling big data a MySQL cluster is the better solution. Why Hadoop is the holy grail of handling big data? What is your opinion?

Answer

Ross picture Ross · May 9, 2015

Both of the above answers miss a huge differentiation between mySQL and Hadoop. mySQL requires you to store data in a certain format. It likes heavily structured data - you declare the data type of each column in a table etc. Hadoop doesn't care about this at all.

Example - if you have a billion text log files, to make analysis even possible for mySQL you'd need to parse and load the data first into a mySQL table, typeing each column along the way. With hadoop and mapreduce, you define the function that is to scan/analyze/return the data from its raw source - you don't need pre-processing ETL to get it pre-structured.

If the data is already structured and in mySQL - then (hopefully) its well structured - why export it for hadoop to analyze? If it isn't, why spend the time to ETL the data?