percentile functions with GROUPBY in BigQuery

donhcd picture donhcd · Aug 9, 2017 · Viewed 11k times · Source

In my CENSUS table, I'd like to group by State, and for each State get the median county population and the number of counties.

In psql, redshift, and snowflake, I can do this:

psql=> SELECT state, count(county), PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY "population2000") AS median FROM CENSUS GROUP BY state;
        state         | count |  median
----------------------+-------+----------
 Alabama              |    67 |    36583
 Alaska               |    24 |   7296.5
 Arizona              |    15 |   116320
 Arkansas             |    75 |    20229
...

I'm trying to find a nice way to do this in standard BigQuery. I've noticed that there's undocumented percentile_cont analytical function available, but I have to do some major hacks to get it to do what I want.

I'd like to be able to do the same sort thing with what I've gathered are the correct arguments:

SELECT
  state,
  COUNT(county),
  PERCENTILE_CONT(population2000,
    0.5) OVER () AS `medPop`
FROM
  CENSUS
GROUP BY
  state;

but this query yields the error

SELECT list expression references column population2000 which is neither grouped nor aggregated at

I can get the answer I want, but I'd be very disappointed if this is the recommended way to do what I want to do:

SELECT
  MAX(nCounties) AS nCounties,
  state,
  MAX(medPop) AS medPop
FROM (
  SELECT
    nCounties,
    T1.state,
    (PERCENTILE_CONT(population2000,
        0.5) OVER (PARTITION BY T1.state)) AS `medPop`
  FROM
    census T1
  LEFT OUTER JOIN (
    SELECT
      COUNT(county) AS `nCounties`,
      state
    FROM
      census
    GROUP BY
      state) T2
  ON
    T1.state = T2.state) T3
GROUP BY
  state

Is there a better way to do what I want to do? Also, is the PERCENTILE_CONT function ever going to be documented?

Thanks for reading!

Answer

Mingyu Zhong picture Mingyu Zhong · Aug 9, 2017

Thanks for your interest. PERCENTILE_CONT is under development, and we will publish the documentation once it is GA. We will support it as analytic function first, and we plan to support it as aggregate function (allowing GROUP BY) later. Between these 2 releases, a simpler workaround would be

SELECT
  state,
  ANY_VALUE(nCounties) AS nCounties,
  ANY_VALUE(medPop) AS medPop
FROM (
  SELECT
    state,
    COUNT(county) OVER (PARTITION BY state) AS nCounties,
    PERCENTILE_CONT(population2000,
      0.5) OVER (PARTITION BY state) AS medPop
  FROM
    CENSUS)
GROUP BY
  state