Chaining multiple MapReduce jobs in Hadoop

Niels Basjes picture Niels Basjes · Mar 23, 2010 · Viewed 85.2k times · Source

In many real-life situations where you apply MapReduce, the final algorithms end up being several MapReduce steps.

i.e. Map1 , Reduce1 , Map2 , Reduce2 , and so on.

So you have the output from the last reduce that is needed as the input for the next map.

The intermediate data is something you (in general) do not want to keep once the pipeline has been successfully completed. Also because this intermediate data is in general some data structure (like a 'map' or a 'set') you don't want to put too much effort in writing and reading these key-value pairs.

What is the recommended way of doing that in Hadoop?

Is there a (simple) example that shows how to handle this intermediate data in the correct way, including the cleanup afterward?

Answer

Binary Nerd picture Binary Nerd · Mar 24, 2010

I think this tutorial on Yahoo's developer network will help you with this: Chaining Jobs

You use the JobClient.runJob(). The output path of the data from the first job becomes the input path to your second job. These need to be passed in as arguments to your jobs with appropriate code to parse them and set up the parameters for the job.

I think that the above method might however be the way the now older mapred API did it, but it should still work. There will be a similar method in the new mapreduce API but i'm not sure what it is.

As far as removing intermediate data after a job has finished you can do this in your code. The way i've done it before is using something like:

FileSystem.delete(Path f, boolean recursive);

Where the path is the location on HDFS of the data. You need to make sure that you only delete this data once no other job requires it.