Hive getting top n records in group by query

TopCoder picture TopCoder · Feb 22, 2012 · Viewed 45k times · Source

I have following table in hive

user-id, user-name, user-address,clicks,impressions,page-id,page-name

I need to find out top 5 users[user-id,user-name,user-address] by clicks for each page [page-id,page-name]

I understand that we need to first group by [page-id,page-name] and within each group I want to orderby [clicks,impressions] desc and then emit only top 5 users[user-id, user-name, user-address] for each page but I am finding it difficult to construct the query.

How can we do this using HIve UDF ?

Answer

Hai-Anh Trinh picture Hai-Anh Trinh · Apr 4, 2013

Revised answer, fixing the bug as mentioned by @Himanshu Gahlot

SELECT page-id, user-id, clicks
FROM (
    SELECT page-id, user-id, rank(page-id) as rank, clicks FROM (
        SELECT page-id, user-id, clicks FROM mytable
        DISTRIBUTE BY page-id
        SORT BY page-id, clicks desc
) a ) b
WHERE rank < 5
ORDER BY page-id, rank

Note that the rank() UDAF is applied to the page-id column, whose new value is used to reset or increase the rank counter (e.g. reset counter for each page-id partition)