I have two files in my cluster File A
and File B
with the following data -
File A
#Format:
#Food Item | Is_A_Fruit (BOOL)
Orange | Yes
Pineapple | Yes
Cucumber | No
Carrot | No
Mango | Yes
File B
#Format:
#Food Item | Vendor Name
Orange | Vendor A
Pineapple | Vendor B
Cucumber | Vendor B
Carrot | Vendor B
Mango | Vendor A
Basically I want to find out How many fruits are each vendor selling?
Expected output:
Vendor A | 2
Vendor B | 1
I need to do this using hadoop streaming python map reduce.
I have read how to do a basic word count, I read from sys.stdin
and emit k,v
pairs for the reducer to then reduce.
How do I approach this problem?
My main concern is how to read from multiple files and then compare them in Hadoop Streaming.
I can do this in normal python (i.e without MapReduce & Hadoop, it's straightforward.) but it's infeasible for the sheer size of data that I have with me.
Is File A really that large? I would put it in the DistributedCache and read it from there. To put it in the distributed cache, use this option in the Hadoop streaming call:
-cacheFile 'hdfs://namenode:port/the/hdfs/path/to/FileA#FileA'
(I suppose the following should work too, but I have not tried it:)
-cacheFile '/the/hdfs/path/to/FileA#FileA'
Note that the #fileA
is the name you are using to make the file available to your mappers.
Then, in your mapper, you'll read FileB from sys.stdin
(asuming you called Hadoop Streaming using -input '/user/foo/FileB'
) AND, to read FileA, you should do something like this:
f = open('FileA', 'r')
...
f.readline()
Now, I suppose you have already thought of this, but to me it would make sense to have a mapper like this: