How do I store gzipped files using PigStorage in Apache Pig?

PP. picture PP. · Feb 11, 2011 · Viewed 13.1k times · Source

Apache Pig v0.7 can read gzipped files with no extra effort on my part, e.g.:

MyData = LOAD '/tmp/data.csv.gz' USING PigStorage(',') AS (timestamp, user, url);

I can process that data and output it to disk okay:

PerUser = GROUP MyData BY user;
UserCount = FOREACH PerUser GENERATE group AS user, COUNT(MyData) AS count;
STORE UserCount INTO '/tmp/usercount' USING PigStorage(',');

But the output file isn't compressed:

/tmp/usercount/part-r-00000

Is there a way of telling the STORE command to output content in gzip format? Note that ideally I'd like an answer applicable for Pig 0.6 as I wish to use Amazon Elastic MapReduce; but if there's a solution for any version of Pig I'd like to hear it.

Answer

ysr picture ysr · Nov 27, 2012

There are two ways:

  1. As mentioned above in the storage you can say the output directory as

    usercount.gz STORE UserCount INTO '/tmp/usercount.gz' USING PigStorage(',');

  2. Set compression method in your script.

    set output.compression.enabled true; set output.compression.codec org.apache.hadoop.io.compress.GzipCodec;