Spark sql top n per group

apache-spark group-by apache-spark-sql top-n

Georg Heiler · Apr 16, 2016 · Viewed 9.9k times · Source

How can I get the top-n (lets say top 10 or top 3) per group in spark-sql?

http://www.xaprb.com/blog/2006/12/07/how-to-select-the-firstleastmax-row-per-group-in-sql/ provides a tutorial for general SQL. However, spark does not implement subqueries in the where clause.

Answer

You can use the window function feature that was added in Spark 1.4 Suppose that we have a productRevenue table as shown below.

the answer to What are the best-selling and the second best-selling products in every category is as follows

SELECT product,category,revenue FROM 
   (SELECT product,category,revenue,dense_rank() 
         OVER (PARTITION BY category ORDER BY revenue DESC) as rank 
    FROM productRevenue) tmp 
WHERE rank <= 2

Tis will give you the desired result

Spark sql top n per group

Answer

Related questions