Namenode failure and recovery in Hadoop

Nagendra kumar picture Nagendra kumar · Nov 21, 2013 · Viewed 11k times · Source

How does Hadoop determine that the Namenode has failed or is not working?

I know that in Hadoop the Namenode is the main point which keeps all the metadata, recognizes the failure of datanodes by heartbeats, and opts for replication of data in case of datanode failure.

If the Namenode fails, which system recognizes the failure, and what is the recovery process?

Answer

Charles Menguy picture Charles Menguy · Nov 21, 2013

It depends which version of Hadoop you are talking about. Before Hadoop 2, the Namenode was a single point of failure, so if it failed that meant your cluster became unusable. Even the SecondaryNameNode doesn't help in that case since it's only used for checkpoints, not as a backup for the NameNode. When the NameNode fails, someone like an administrator would have to manually restart the NameNode.

But since Hadoop 2, you have a better way to handle failures in the NameNode. You can run 2 redundant NameNodes alongside one another, so that if one of the Namenodes fails, the cluster will quickly failover to the other NameNode.

The way it works is pretty transparent, basically the DataNodes will send reports to both NameNodes so that if one fails, the other one will be ready to be used in active mode. And for the client, it simply contacts every NameNode configured until it finds the active one. So if it gets a reply saying to try elsewhere, or if the NameNode doesn't reply, it knows that it needs to use a different NameNode.

Here is a schema taken from the Cloudera blog which explains that in more details:

HANN

You can also take a look at the HA article on the official documentation on how to set this up.