How does Hadoop determine that the Namenode has failed or is not working?
I know that in Hadoop the Namenode is the main point which keeps all the metadata, recognizes the failure of datanodes by heartbeats, and opts for replication of data in case of datanode failure.
If the Namenode fails, which system recognizes the failure, and what is the recovery process?
It depends which version of Hadoop you are talking about. Before Hadoop 2, the Namenode
was a single point of failure, so if it failed that meant your cluster became unusable. Even the SecondaryNameNode
doesn't help in that case since it's only used for checkpoints, not as a backup for the NameNode
. When the NameNode
fails, someone like an administrator would have to manually restart the NameNode
.
But since Hadoop 2, you have a better way to handle failures in the NameNode
. You can run 2 redundant NameNodes
alongside one another, so that if one of the Namenodes
fails, the cluster will quickly failover to the other NameNode
.
The way it works is pretty transparent, basically the DataNodes
will send reports to both NameNodes
so that if one fails, the other one will be ready to be used in active mode. And for the client, it simply contacts every NameNode
configured until it finds the active one. So if it gets a reply saying to try elsewhere, or if the NameNode
doesn't reply, it knows that it needs to use a different NameNode
.
Here is a schema taken from the Cloudera blog which explains that in more details:
You can also take a look at the HA article on the official documentation on how to set this up.