Summing values of Hive array types

Alex N. picture Alex N. · Sep 12, 2012 · Viewed 10.7k times · Source

Hive has this pretty nice Array type that is very useful in theory but when it comes to practice I found very little information on how to do any kind of opeartions with it. We store a serie of numbers in an array type column and need to SUM them in a query, preferably from n-th to m-th element. Is it possible with standard HiveQL or does it require a UDF or customer mapper/reducer?

Note: we're using Hive 0.8.1 in EMR environment.

Answer

Lorand Bendig picture Lorand Bendig · Sep 17, 2012

I'd write a simple UDF for this purpose. You need to have hive-exec in your build path.
E.g In case of Maven:

<dependency>
  <groupId>org.apache.hive</groupId>
  <artifactId>hive-exec</artifactId>
  <version>0.8.1</version>
</dependency>

A simple raw implementation would look like this:

package com.myexample;

import java.util.ArrayList;
import java.util.List;

import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.IntWritable;

public class SubArraySum extends UDF {

    public IntWritable evaluate(ArrayList<Integer> list, 
      IntWritable from, IntWritable to) {
        IntWritable result = new IntWritable(-1);
        if (list == null || list.size() < 1) {
            return result;
        }

        int m = from.get();
        int n = to.get();

        //m: inclusive, n:exclusive
        List<Integer> subList = list.subList(m, n);

        int sum = 0;
        for (Integer i : subList) {
            sum += i;
        }
        result.set(sum);
        return result;
    }
}

Next, build a jar and load it in Hive shell:

hive> add jar /home/user/jar/myjar.jar;
hive> create temporary function subarraysum as 'com.myexample.SubArraySum';

Now you can use it to calculate the sum of the array you have.

E.g:

Let's assume that you have an input file having tab-separated columns in it :

1   0,1,2,3,4
2   5,6,7,8,9

Load it into mytable:

hive> create external table mytable (
  id int,
  nums array<int>
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
COLLECTION ITEMS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION '/user/hadoopuser/hive/input';

Execute some queries then:

hive> select * from mytable;
1   [0,1,2,3,4]
2   [5,6,7,8,9]

Sum it in range m,n where m=1, n=3

hive> select subarraysum(nums, 1,3) from mytable;
3
13

Or

hive> select sum(subarraysum(nums, 1,3)) from mytable;
16