I have following table in hive
user-id, user-name, user-address,clicks,impressions,page-id,page-name
I need to find out top 5 users[user-id,user-name,user-address] by clicks for each page [page-id,page-name]
I understand that we need to first group by [page-id,page-name] and within each group I want to orderby [clicks,impressions] desc and then emit only top 5 users[user-id, user-name, user-address] for each page but I am finding it difficult to construct the query.
How can we do this using HIve UDF ?
Revised answer, fixing the bug as mentioned by @Himanshu Gahlot
SELECT page-id, user-id, clicks
FROM (
SELECT page-id, user-id, rank(page-id) as rank, clicks FROM (
SELECT page-id, user-id, clicks FROM mytable
DISTRIBUTE BY page-id
SORT BY page-id, clicks desc
) a ) b
WHERE rank < 5
ORDER BY page-id, rank
Note that the rank() UDAF is applied to the page-id column, whose new value is used to reset or increase the rank counter (e.g. reset counter for each page-id partition)