Skipping the header while loading the text file using Piglatin

Pawan Kumar picture Pawan Kumar · Oct 1, 2013 · Viewed 20.7k times · Source

I have a text file and it's first row contains the header. Now I want to do some operation on the data, but while loading the file using PigStorage it takes the HEADER too. I just want to skip the HEADER. Is it possible to do so(directly or through a UDF)?

This is the command which i'm using to load the data:

input_file = load '/home/hadoop/smdb_tracedata.csv'
USING PigStorage(',')
as (trans:chararray, carrier:chararray,aainday:chararray);

Answer

Davis Broda picture Davis Broda · Oct 1, 2013

If you have pig version 0.11 you could try this:

input_file = load '/home/hadoop/smdb_tracedata.csv' USING PigStorage(',') as (trans:chararray, carrier :chararray,aainday:chararray);

ranked = rank input_file;

NoHeader = Filter ranked by (rank_input_file > 1);

Ordered = Order NoHeader by rank_input_file

New_input_file = foreach Ordered Generate trans, carrier, aainday;

This would get rid of the first row, leaving New_input_file exactly the same as the original, without the header row (assuming header row is the first row in the file). Please note that the rank operator is only available in pig 0.11, so if you have an earlier version you will need to find another way.

Edit: added the ordered line in order to make sure New_input_file maintains the same order as the original input file