How do I pass a parameter to a python Hadoop streaming job?

zzztimbo picture zzztimbo · Mar 1, 2012 · Viewed 8.7k times · Source

For a python Hadoop streaming job, how do I pass a parameter to, for example, the reducer script so that it behaves different based on the parameter being passed in?

I understand that streaming jobs are called in the format of:

hadoop jar hadoop-streaming.jar -input -output -mapper mapper.py -reducer reducer.py ...

I want to affect reducer.py.

Answer

Ray Toal picture Ray Toal · Mar 1, 2012

The argument to the command line option -reducer can be any command, so you can try:

$HADOOP_HOME/bin/hadoop  jar $HADOOP_HOME/hadoop-streaming.jar \
    -input inputDirs \
    -output outputDir \
    -mapper myMapper.py \
    -reducer 'myReducer.py 1 2 3' \
    -file myMapper.py \
    -file myReducer.py

assuming myReducer.py is made executable. Disclaimer: I have not tried it, but I have passed similar complex strings to -mapper and -reducer before.

That said, have you tried the

-cmdenv name=value

option, and just have your Python reducer get its value from the environment? It's just another way to do things.