I'm using PigLatin to filter some records.
User1 8 NYC
User1 9 NYC
User1 7 LA
User2 4 NYC
User2 3 DC
The script should remove the duplicate for users, and keep one of these records. Something like the unique command in linux.
The output should be:
User1 8 NYC
User2 4 NYC
Any suggestions?
For your particular example distinct will not work well as your output contains all of the input columns ($0, $1, $2)
, you can do distinct only on a projection that has columns ($0, $2)
or ($0)
and lose $1
.
In order to select one record per user (any record) you could use a GROUP BY
and a nested FOREACH
with LIMIT
. Ex:
inpt = load '......' ......;
user_grp = GROUP inpt BY $0;
filtered = FOREACH user_grp {
top_rec = LIMIT inpt 1;
GENERATE FLATTEN(top_rec);
};
This approach will help you get records that are unique on a subset of fields and also limit number of output records per each user, which you can control.