I have a feed in the following format:
Hour Key ID Value
1 K1 001 3
1 K1 002 2
2 K1 005 4
1 K2 002 1
2 K2 003 5
2 K2 004 6
and I want to group the feed by (Hour, Key)
then sum the Value
but keep ID
as a tuple:
({1, K1}, {001, 002}, 5)
({2, K1}, {005}, 4)
({1, K2}, {002}, 1)
({2, K2}, {003, 004}, 11)
I know how to use FLATTEN
to generate the sum of the Value
but don't know how to output ID
as a tuple. This is what I have so far:
A = LOAD 'data' AS (Hour:chararray, Key:chararray, ID:chararray, Value:int);
B = GROUP A BY (Hour, Key);
C = FOREACH B GENERATE
FLATTEN(group) AS (Hour, Key),
SUM(A.Value) AS Value
;
Will you explain how to do this? Appreciate it!
You just need to use the bag projection operator, .
. This will create a new bag where the tuples have just the element(s) you specify. In your case, use A.ID
. In fact, you are already using this operator to provide the input to SUM
-- the input to sum is a bag of single-element tuples, which you create by projecting the Value
field.
A = LOAD 'data' AS (Hour:chararray, Key:chararray, ID:chararray, Value:int);
B = GROUP A BY (Hour, Key);
C = FOREACH B GENERATE
FLATTEN(group) AS (Hour, Key),
A.ID,
SUM(A.Value) AS Value
;