Python MapReduce Hadoop Streaming Job that requires multiple input files?

ComputerFellow picture ComputerFellow · Dec 27, 2013 · Viewed 14.4k times · Source

I have two files in my cluster File A and File B with the following data -

File A

#Format: 
#Food Item | Is_A_Fruit (BOOL)

Orange | Yes
Pineapple | Yes
Cucumber | No
Carrot | No
Mango | Yes

File B

#Format:
#Food Item | Vendor Name

Orange | Vendor A
Pineapple | Vendor B
Cucumber | Vendor B
Carrot | Vendor B
Mango | Vendor A

Basically I want to find out How many fruits are each vendor selling?

Expected output:

Vendor A | 2
Vendor B | 1

I need to do this using hadoop streaming python map reduce.

I have read how to do a basic word count, I read from sys.stdin and emit k,v pairs for the reducer to then reduce.

How do I approach this problem?

My main concern is how to read from multiple files and then compare them in Hadoop Streaming.

I can do this in normal python (i.e without MapReduce & Hadoop, it's straightforward.) but it's infeasible for the sheer size of data that I have with me.

Answer

cabad picture cabad · Dec 27, 2013

Is File A really that large? I would put it in the DistributedCache and read it from there. To put it in the distributed cache, use this option in the Hadoop streaming call:

-cacheFile 'hdfs://namenode:port/the/hdfs/path/to/FileA#FileA'

(I suppose the following should work too, but I have not tried it:)

-cacheFile '/the/hdfs/path/to/FileA#FileA'

Note that the #fileA is the name you are using to make the file available to your mappers.

Then, in your mapper, you'll read FileB from sys.stdin (asuming you called Hadoop Streaming using -input '/user/foo/FileB') AND, to read FileA, you should do something like this:

f = open('FileA', 'r')
...
f.readline()

Now, I suppose you have already thought of this, but to me it would make sense to have a mapper like this:

  1. Open FileA
  2. Read FileA, line by line (in a loop) and load it into a map so that you can easily lookup a key and find its value (yes, no).
  3. Have your main loop reading from stdin. Within the loop, for each line (in FileB), check your map (see step 2) to find out whether you have a fruit or not... etc.