Chaining multiple mapreduce tasks in Hadoop streaming

Varadharajan Mukundan picture Varadharajan Mukundan · Jan 7, 2011 · Viewed 7.3k times · Source

I am in scenario where I have two mapreduce jobs. I am more comfortable with python and planning to use it for writing mapreduce scripts and use hadoop streaming for the same. is there a convenient to chain both the jobs following form when hadoop streaming is used?

Map1 -> Reduce1 -> Map2 -> Reduce2

I've heard a lot of methods to accomplish this in java, But i need something for Hadoop streaming.

Answer

cwensel picture cwensel · Jan 7, 2011

Here is a great blog post on how to use Cascading and Streaming. http://www.xcombinator.com/2009/11/18/how-to-use-cascading-with-hadoop-streaming/

The value here is you can mix java (Cascading query flows) with your custom streaming operations in the same app. I find this much less brittle than other methods.

Note, the Cascade object in Cascading allows you to chain multiple Flows (via the above blog post your Streaming job would become a MapReduceFlow).

Disclaimer: I'm the author of Cascading