Why do we need ZooKeeper in the Hadoop stack?

user1099871 picture user1099871 · May 24, 2012 · Viewed 35.5k times · Source

I am new to Hadoop/ZooKeeper. I cannot understand the purpose of using ZooKeeper with Hadoop, is ZooKeeper writing data in Hadoop? If not, then why we do we use ZooKeeper with Hadoop?

Answer

Arnon Rotem-Gal-Oz picture Arnon Rotem-Gal-Oz · May 24, 2012

Hadoop 1.x does not use Zookeeper. HBase does use zookeeper even in Hadoop 1.x installations.

Hadoop adopted Zookeeper as well starting with version 2.0.

The purpose of Zookeeper is cluster management. This fits with the general philosophy of *nix of using smaller specialized components - so components of Hadoop that want clustering capabilities rely on Zookeeper for that rather than develop their own.

Zookeeper is a distributed storage that provides the following guarantees (copied from Zookeeper overview page):

  • Sequential Consistency - Updates from a client will be applied in the order that they were sent.
  • Atomicity - Updates either succeed or fail. No partial results.
  • Single System Image - A client will see the same view of the service regardless of the server that it connects to.
  • Reliability - Once an update has been applied, it will persist from that time forward until a client overwrites the update.
  • Timeliness - The clients view of the system is guaranteed to be up-to-date within a certain time bound.

You can use these to implement different "recipes" that are required for cluster management like locks, leader election etc.

If you're going to use ZooKeeper yourself, I recommend you take a look at Curator from Netflix which makes it easier to use (e.g. they implement a few recipes out of the box)