For a file of the form
A B user1
C D user2
A D user3
A D user1
I want to calculate the count of distinct values of field 3 i.e. count(distinct(user1, user2,user2,user1)) = 3
I am doing this using the following pig script
A = load 'myTestData' using PigStorage('\t') as (a1,a2,a3);
user_list = foreach A GENERATE $2;
unique_users = DISTINCT user_list;
unique_users_group = GROUP unique_users ALL;
uu_count = FOREACH unique_users_group GENERATE COUNT(unique_users);
store uu_count into 'output';
Is there a better way to get count of distinct values of a field?
A more up-to-date way to do this:
user_data = LOAD 'myTestData' USING PigStorage('\t') AS (a1,a2,a3);
users = FOREACH user_data GENERATE a3;
uniq_users = DISTINCT users;
grouped_users = GROUP uniq_users ALL;
uniq_user_count = FOREACH grouped_users GENERATE COUNT(uniq_users);
DUMP uniq_user_count;
This will leave the value (3)
in your log.