Large scale data processing Hbase vs Cassandra

Gary Lindahl  picture Gary Lindahl · Aug 30, 2011 · Viewed 36.6k times · Source

I am nearly landed at Cassandra after my research on large scale data storage solutions. But its generally said that Hbase is better solution for large scale data processing and analysis.

While both are same key/value storage and both are/can run (Cassandra recently) Hadoop layer then what makes Hadoop a better candidate when processing/analysis is required on large data.

I also found good details about both at http://ria101.wordpress.com/2010/02/24/hbase-vs-cassandra-why-we-moved/

but I'm still looking for concrete advantages of Hbase.

While I am more convinced about Cassandra because its simplicity for adding nodes and seamless replication and no point of failure features. And it also keeps secondary index feature so its a good plus.

Answer

jbellis picture jbellis · Aug 30, 2011

As a Cassandra developer, I'm better at answering the other side of the question:

  • Cassandra scales better. Cassandra is known to scale to over 400 nodes in a cluster; when Facebook deployed Messaging on top of HBase they had to shard it across 100-node HBase sub-clusters.
  • Cassandra supports hundreds, even thousands of ColumnFamilies. "HBase currently does not do well with anything above two or three column families."
  • As a fully distributed system with no "special" nodes or processes, Cassandra is simpler to set up and operate, easier to troubleshoot, and more robust.
  • Cassandra's support for multi-master replication means that not only do you get the obvious power of multiple datacenters -- geographic redundancy, local latencies -- but you can also split realtime and analytical workloads into separate groups, with realtime, bidirectional replication between them. If you don't split those workloads apart they will contend spectacularly.
  • Because each Cassandra node manages its own local storage, Cassandra has a substantial performance advantage that is unlikely to be narrowed significantly. (E.g., it's standard practice to put the Cassandra commitlog on a separate device so it can do its sequential writes unimpeded by random i/o from read requests.)
  • Cassandra allows you to choose how strong you want it to require consistency to be on a per-operation basis. Sometimes this is misunderstood as "Cassandra does not give you strong consistency," but that is incorrect.
  • Cassandra offers RandomPartitioner as well as the more Bigtable-like OrderedPartitioner. RandomPartitioner is much less prone to hot spots.
  • Cassandra offers on- or off-heap caching with performance comparable to memcached, but without the cache consistency problems or complexity of requiring extra moving parts
  • Non-Java clients are not second-class citizens

To my knowledge, the main advantage HBase has right now (HBase 0.90.4 and Cassandra 0.8.4) is that Cassandra does not yet support transparent data compression. (This has been added for Cassandra 1.0, due in early October, but today that is a real advantage for HBase.) HBase may also be better optimized for the kinds of range scans done by Hadoop batch processing.

There are also some things that are not necessarily better, or worse, just different. HBase adheres more strictly to the Bigtable data model, where each column is versioned implicitly. Cassandra drops versioning, and adds SuperColumns instead.

Hope that helps!