I was reading about Hadoop and how fault tolerant it is. I read the HDFS and read how failure of master and slave nodes can be handled. However, i couldnt find any document that mentions how the mapreduce performs fault tolerance. Particularly, what happens when the Master node containing Job Tracker goes down or any of the slave nodes goes down?
If anyone can point me to some links and references that explains this in detail.
Fault Tolerance of MapReduce layer depends on the hadoop version. For versions before hadoop.0.21, no checkpointing was done and failure of JobTracker would lead to loss of data.
However, versions starting hadoop.0.21, checkpointing was added where JobTracker records its progress in a file. When a JobTracker starts up, it looks for such data, so that it can restart work from where it left off.