How to calculate median of a numeric sequence in Google BigQuery efficiently?

Manish Agrawal picture Manish Agrawal · Mar 17, 2015 · Viewed 27.2k times · Source

I need to calculate median value of a numeric sequence in Google BigQuery efficiently. Is the same possible?

Answer

Pentium10 picture Pentium10 · Mar 17, 2015

Yeah it's possible with PERCENTILE_CONT window function.

Returns values that are based upon linear interpolation between the values of the group, after ordering them per the ORDER BY clause.

must be between 0 and 1.

This window function requires ORDER BY in the OVER clause.

So an example query would be like (the max() is there just to work across the group by but it's not being used as a math logic, should not confuse you)

SELECT room,
      max(median) FROM   (SELECT room,
         percentile_cont(0.5) OVER (PARTITION BY room
                                    ORDER BY temperature) AS median    FROM
    (SELECT 1 AS room,
            11 AS temperature),
    (SELECT 1 AS room,
            12 AS temperature),
    (SELECT 1 AS room,
            14 AS temperature),
    (SELECT 1 AS room,
            19 AS temperature),
    (SELECT 1 AS room,
            13 AS temperature),
    (SELECT 2 AS room,
            20 AS temperature),
    (SELECT 2 AS room,
            21 AS temperature),
    (SELECT 2 AS room,
            29 AS temperature),
    (SELECT 3 AS room,
            30 AS temperature)) GROUP BY room

This returns:

+------+-------------+
| room | temperature |
+------+-------------+
|    1 |          13 |
|    2 |          21 |
|    3 |          30 |
+------+-------------+