Hadoop, Hive, Pig, HBase, Cassandra - when to use what?

Daniel picture Daniel · Jan 29, 2014 · Viewed 10.8k times · Source

First of all I am relatively new to Big Data and the Hadoop world and I have just started to experiment a little with the Hortonworks Sandbox (Pig and Hive so far). I was wondering in which cases could I use the above mentioned tools of Hadoop, Hive, Pig, HBase and Cassandra?

In my sandbox environment with a file of just 9MB Hive and Pig had response times of seconds to minutes. This is obviously not usable in some situations for example web applications (unless it is something else such as my virtual machine setup).

My guesses about the correct usages are:

  • Hadoop: Just the technological base for the rest, only very few use-cases where it would be used directly
  • Hive or Pig: For analytical processes that run once per hour or day
  • HBase or Cassandra: for real-time applications (e.g. web applications) where response times with 100ms or less are required

Additionally, when to use HBase as opposed to when to use Cassandra?

Thanks!

Answer

Chaos picture Chaos · Jan 29, 2014

Your guesses are somewhat accurate.

By Hadoop, I guess you are referring to MapReduce? Hadoop as such is an ecosystem which consists of many components (including MapReduce, HDFS, Pig and Hive).

MapReduce is good when you need to write the logic for processing data at the Map() and Reduce() method level. In my work, I find MapReduce very useful when I'm dealing with data that is unstructured & needs to be cleansed.

Hive,Pig: They are good for batch processes, running periodically (maybe in terms of hours or days)

HBase & Cassandra: Support low latency calls. So they can be used for real time applications, where response time is key. Have a look at this discussion to get a better idea about HBase vs Cassandra.