I am a bit confused with the output I get from Mapper.
For example, when I run a simple wordcount program, with this input text:
hello world
Hadoop programming
mapreduce wordcount
lets see if this works
12345678
hello world
mapreduce wordcount
this is the output that I get:
12345678 1
Hadoop 1
hello 1
hello 1
if 1
lets 1
mapreduce 1
mapreduce 1
programming 1
see 1
this 1
wordcount 1
wordcount 1
works 1
world 1
world 1
As you can see, the output from mapper is already sorted. I did not run Reducer
at all.
But I find in a different project that the output from mapper is not sorted.
So I am totally clear about this..
My questions are:
sort and shuffle
phase and persist it before it goes to Reducer? A reducer is presented with a key and a list of iterables. Is there a way, I could persist this data?Is the mapper's output always sorted?
No. It is not sorted if you use no reducer. If you use a reducer, there is a pre-sorting process before the mapper's output is written to disk. Data gets sorted in the Reduce phase. What is happening here (just a guess) is that you are not specifying a Reducer class, which, in the new API, is translated into using the Identity Reducer (see this answer and comment). The Identity Reducer just outputs its input. To verify that, see the default Reducer counters (there should be some reduce tasks, reduce input records & groups, reduce output records...)
Is the sort phase integrated into the mapper phase already, so that the output of map phase is already sorted in the intermediate data?
As I explained in the previous question, if you use no reducers, mapper does not sort the data. If you do use reducers, the data start getting sorted from the map phase and then get merge-sorted in the reduce phase.
Is there a way to collect the data from sort and shuffle phase and persist it before it goes to Reducer. A reducer is presented with a key and a list of iterables. Is there a way, I could persist this data?
Again, shuffling and sorting are parts of the Reduce phase. An Identity Reducer will do what you want. If you want to output one key-value pair per reducer, with the values being a concatenation of the iterables, just store the iterables in memory (e.g. in a StringBuffer) and then output this concatenation as a value. If you want the map output to go straight to the program's output, without going through a reduce phase, then set in the driver class the number of reduce tasks to zero, like that:
job.setNumReduceTasks(0);
This will not get your output sorted, though. It will skip the pre-sorting process of the mapper and write the output directly to HDFS.