In analytics processing there is often a need to collapse "unimportant" groups of data into a single row in the resulting table. One way to do this is to GROUP BY a CASE expression where unimportant groups are coalesced into a single row via the CASE expression returning a single value, e.g., NULL for the groups. This question is about efficient ways to perform this grouping in Amazon Redshift, which is based on ParAccel: close to PosgreSQL 8.0 in terms of functionality.
As an example, consider a GROUP BY on type
and url
in a table where each row is a single URL visit. The goal is to perform aggregation such that one row is emitted for every (type, url) pair where the URL visit count exceeds a certain threshold and one (type, NULL) row is emitted for all (type, url) pairs where the visit count is under that threshold. The rest of the columns in the result table would have SUM/COUNT aggregates based on this grouping.
For example, the following data
+------+----------------------+-----------------------+
| type | url | < 50+ other columns > |
+------+----------------------+-----------------------+
| A | http://popular.com | |
| A | http://popular.com | |
| A | < 9997 more times> | |
| A | http://popular.com | |
| A | http://small-one.com | |
| B | http://tiny.com | |
| B | http://tiny-too.com | |
should produce the following result table with a threshold of 10,000
+------+------------------------------------+--------------------------+
| type | url | visit_count | < SUM/COUNT aggregates > |
+------+------------------------------------+--------------------------+
| A | http://popular.com | 10000 | |
| A | | 1 | |
| B | | 2 | |
Summary:
Amazon Redshift has certain subquery correlation limitations one needs to tip-toe around. Gordon Linoff's answer below (the accepted answer) shows how to perform a GROUP BY a CASE expression using double aggregation and replicating the expression in both the result column and the outer GROUP BY clause.
with temp_counts as (SELECT type, url, COUNT(*) as cnt FROM t GROUP BY type, url)
select type, (case when cnt >= 10000 then url end) as url, sum(cnt) as cnt
from temp_counts
group by type, (case when cnt >= 10000 then url end)
Further testing indicated that the double aggregation can be "unrolled" into a UNION ALL of independent queries involving each independent CASE expression. In this particular case on a sample data set with approximately 200M rows, this approach consistently performed about 30% faster. That result is schema and data-specific, however.
with temp_counts as (SELECT type, url, COUNT(*) as cnt FROM t GROUP BY type, url)
select * from temp_counts WHERE cnt >= 10000
UNION ALL
SELECT type, NULL as url, SUM(cnt) as cnt from temp_counts
WHERE cnt < 10000
GROUP BY type
This suggests two general patterns for implementing and optimizing arbitrary disjoined grouping & summarization in Amazon Redshift. If performance is important for you, benchmark both.
You would do this with two aggregations:
select type, (case when cnt > XXX then url end) as url, sum(cnt) as visit_cnt
from (select type, url, count(*) as cnt
from t
group by type, url
) t
group by type, (case when cnt > XXX then url end)
order by type, sum(cnt) desc;