Removing duplicates using PigLatin

aalsum picture aalsum · Jul 18, 2012 · Viewed 15.7k times · Source

I'm using PigLatin to filter some records.

User1  8 NYC 
User1  9 NYC 
User1  7 LA 
User2  4 NYC
User2  3 DC 

The script should remove the duplicate for users, and keep one of these records. Something like the unique command in linux.

The output should be:

User1 8 NYC 
User2 4 NYC

Any suggestions?

Answer

alexeipab picture alexeipab · Jul 19, 2012

For your particular example distinct will not work well as your output contains all of the input columns ($0, $1, $2), you can do distinct only on a projection that has columns ($0, $2) or ($0) and lose $1.

In order to select one record per user (any record) you could use a GROUP BY and a nested FOREACH with LIMIT. Ex:

inpt = load '......' ......;
user_grp = GROUP inpt BY $0;
filtered = FOREACH user_grp {
      top_rec = LIMIT inpt 1;
      GENERATE FLATTEN(top_rec);
};

This approach will help you get records that are unique on a subset of fields and also limit number of output records per each user, which you can control.