Hive: Sum over a specified group (HiveQL)

joshlk picture joshlk · Aug 1, 2014 · Viewed 51.8k times · Source

I have a table:

key    product_code    cost
1      UK              20
1      US              10
1      EU              5
2      UK              3
2      EU              6

I would like to find the sum of all products for each group of "key" and append to each row. For example for key = 1, find the sum of costs of all products (20+10+5=35) and then append result to all rows which correspond to the key = 1. So end result:

key    product_code    cost     total_costs
1      UK              20       35
1      US              10       35
1      EU              5        35
2      UK              3        9
2      EU              6        9

I would prefer to do this without using a sub-join as this would be inefficient. My best idea would be to use the over function in conjunction with the sum function but I cant get it to work. My best try:

SELECT key, product_code, sum(costs) over(PARTITION BY key)
FROM test
GROUP BY key, product_code;

Iv had a look at the docs but there so cryptic I have no idea how to work out how to do it. Im using Hive v0.12.0, HDP v2.0.6, HortonWorks Hadoop distribution.

Answer

joshlk picture joshlk · Aug 12, 2014

Similar to @VB_ answer, use the BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING statement.

The HiveQL query is therefore:

SELECT key, product_code,
SUM(costs) OVER (PARTITION BY key ORDER BY key ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
FROM test;