Apache Pig v0.7 can read gzipped files with no extra effort on my part, e.g.:
MyData = LOAD '/tmp/data.csv.gz' USING PigStorage(',') AS (timestamp, user, url);
I can process that data and output it to disk okay:
PerUser = GROUP MyData BY user;
UserCount = FOREACH PerUser GENERATE group AS user, COUNT(MyData) AS count;
STORE UserCount INTO '/tmp/usercount' USING PigStorage(',');
But the output file isn't compressed:
/tmp/usercount/part-r-00000
Is there a way of telling the STORE
command to output content in gzip format? Note that ideally I'd like an answer applicable for Pig 0.6 as I wish to use Amazon Elastic MapReduce; but if there's a solution for any version of Pig I'd like to hear it.
There are two ways:
As mentioned above in the storage you can say the output directory as
usercount.gz
STORE UserCount INTO '/tmp/usercount.gz' USING PigStorage(',');
Set compression method in your script.
set output.compression.enabled true;
set output.compression.codec org.apache.hadoop.io.compress.GzipCodec;