In Apache Pig, select DISTINCT rows based on a single column

Arel picture Arel · May 28, 2014 · Viewed 8k times · Source

Let's say I have a table such as the one below, that may or may not contain duplicates for a given field:

ID     URL
---    ------------------
001    http://example.com/adam
002    http://example.com/beth
002    http://example.com/beth?extra=blah
003    http://example.com/charlie

I would like to write a Pig script to find only DISTINCT rows, based on the value of a single field. For instance, filtering the table above by ID should return something like the following:

ID     URL
---    ------------------
001    http://example.com/adam
002    http://example.com/beth
003    http://example.com/charlie

The Pig GROUP BY operator returns a bag of tuples grouped by ID, which would work if I knew how to get just the first tuple per bag (perhaps a separate question).

The Pig DISTINCT operator works on the entire row, so in this case all four rows would be considered unique, which is not what I want.

For my purposes, I do not care which of the rows with ID 002 are returned.

Answer

Arel picture Arel · May 28, 2014

I found one way to do this, using the GROUP BY and the TOP operators:

my_table = LOAD 'my_table_file' AS (A, B);

my_table_grouped = GROUP my_table BY A;

my_table_distinct = FOREACH my_table_grouped {

    -- For each group $0 refers to the group name, (A)
    -- and $1 refers to a bag of entire rows {(A, B), (A, B), ...}.
    -- Here, we take only the first (top 1) row in the bag:

    result = TOP(1, 0, $1);
    GENERATE FLATTEN(result);

}

DUMP my_table_distinct;

This results in one distinct row per ID column:

(001,http://example.com/adam)
(002,http://example.com/beth?extra=blah)
(003,http://example.com/charlie)

I don't know if there is a better approach, but this works for me. I hope this helps others starting out with Pig.

(Reference: http://pig.apache.org/docs/r0.12.1/func.html#topx)