Storing results of UNION in PIG in a single file

Uno picture Uno · Jun 8, 2012 · Viewed 8.5k times · Source

I have a PIG Script which produces four results I want to store all of them in a single file. I tries using UNION, however when I use UNION I get four files part-m-00000, part-m-00001, part-m-00002, part-m-00003. Cant I get a single file?

Here is the PIG script

A = UNION Message_1,Message_2,Message_3,Message_4 into 'AA';

Inside the AA folder I get 4 files as mentioned above. Can't I get a single file with all entries in it?

Answer

Donald Miner picture Donald Miner · Jun 9, 2012

Pig is doing the right thing here and is unioning the data sets. All being one file doesn't mean one data set in Hadoop... one data set in Hadoop is usually a folder. Since it doesn't need to run a reduce here, it's not going to.

You need to fool Pig to run a Map AND Reduce. The way I usually do this is:

set default_parallel 1

...
A = UNION Message_1,Message_2,Message_3,Message_4;
B = GROUP A BY 1; -- group ALL of the records together
C = FOREACH B GENERATE FLATTEN(A);
...

The GROUP BY groups all of the records together, and then the FLATTEN explodes that list back out.


One thing to note here is that this isn't much different from doing:

$ hadoop fs -cat msg1.txt msg2.txt msg3.txt msg4.txt | hadoop fs -put - union.txt

(this is concatenating all of the text, and then writing it back out to HDFS as a new file)

This isn't parallel at all, but neither is funneling all of your data through one reducer.