Hadoop namenode : Single point of failure

rakeshr picture rakeshr · Dec 21, 2010 · Viewed 14.9k times · Source

The Namenode in the Hadoop architecture is a single point of failure.

How do people who have large Hadoop clusters cope with this problem?.

Is there an industry-accepted solution that has worked well wherein a secondary Namenode takes over in case the primary one fails ?

Answer

Bkkbrad picture Bkkbrad · Dec 21, 2010

Yahoo has certain recommendations for configuration settings at different cluster sizes to take NameNode failure into account. For example:

The single point of failure in a Hadoop cluster is the NameNode. While the loss of any other machine (intermittently or permanently) does not result in data loss, NameNode loss results in cluster unavailability. The permanent loss of NameNode data would render the cluster's HDFS inoperable.

Therefore, another step should be taken in this configuration to back up the NameNode metadata

Facebook uses a tweaked version of Hadoop for its data warehouses; it has some optimizations that focus on NameNode reliability. Additionally to the patches available on github, Facebook appears to use AvatarNode specifically for quickly switching between primary and secondary NameNodes. Dhruba Borthakur's blog contains several other entries offering further insights into the NameNode as a single point of failure.

Edit: Further info about Facebook's improvements to the NameNode.