How to perform a DISTINCT in Pig Latin on a subset of columns?

Freerobots picture Freerobots · Sep 26, 2013 · Viewed 35.3k times · Source

I would like to perform a DISTINCT operation on a subset of the columns. The documentation says this is possible with a nested foreach:

You cannot use DISTINCT on a subset of fields; to do this, use FOREACH and a nested block to first select the fields and then apply DISTINCT (see Example: Nested Block).

It is simple to perform a DISTINCT operation on all of the columns:

A = LOAD 'data' AS (a1,a2,a3,a4);
A_unique = DISTINCT A;

Lets say that I am interested in performing the distinct across a1, a2, and a3. Can anyone provide an example showing how to perform this operation with a nested foreach as suggested in the documentation?

Here's an example of input and expected output:

A = LOAD 'data' AS(a1,a2,a3,a4);
DUMP A;

(1 2 3 4)
(1 2 3 4)
(1 2 3 5)
(1 2 4 4)

-- insert DISTINCT operation on a1,a2,a3 here:
-- ...

DUMP A_unique;

(1 2 3 4)
(1 2 4 4)

Answer

reo katoa picture reo katoa · Sep 26, 2013

Group on all the other columns, project just the columns of interest into a bag, and then use FLATTEN to expand them out again:

A_unique =
    FOREACH (GROUP A BY a4) {
        b = A.(a1,a2,a3);
        s = DISTINCT b;
        GENERATE FLATTEN(s), group AS a4;
    };