What does it mean to have multiple sortkey columns?

Lorrin picture Lorrin · Jun 14, 2013 · Viewed 12.8k times · Source

Redshift allows designating multiple columns as SORTKEY columns, but most of the best-practices documentation is written as if there were only a single SORTKEY.

If I create a table with SORTKEY (COL1, COL2), does that mean that all columns are stored sorted by COL1, then COL2? Or maybe, since it is a columnar store, each column gets stored in a different order? I.e. COL1 in COL1 order, COL2 in COL2 order, and the other columns unordered?

My situation is that I have a table with (among others) a type_id and a timestamp column. Data arrives roughly in timestamp order. Most queries are joined against / restricted by both type_id and timestamp. Usually the type_id clauses are more specific, meaning a much larger percentage of rows can be excluded by looking at the type_id clause than by looking at the timestamp clause. type_id is the DISTKEY for this reason. I'm trying to understand the pros and cons of SORTKEY (type_id), SORTKEY (stamp), SORTKEY (type_id,stamp), SORTKEY (stamp,type_id).

Thanks.

Answer

Enno Shioji picture Enno Shioji · Jul 7, 2013

If you declare SORTKEY(COL1, COL2), all columns will be sorted by COL1, then COL2 as if ORDER BY (COL1, COL2) was done.

If you are using SORTKEY to speed up a JOIN, AFAIU it doesn't matter so long as you use the same SORTKEY on the tables that will be joined because what happens is a merge join.

If COL1 is highly selective like your type_id, it means there are only small numbers of rows which has the same type_id. Therefore although you can add another column to SORTKEY, its utility is limited because most of the row elimination has already happened.

If COL1 is not highly selective like your stamp (which is a bit weird btw; I would have expected it to be more selective than type_id? Anyways..), it means that filtering by stamp won't eliminate that much rows. So it makes more sense to declare a second sort key. However, this is less efficient than the other way around as eliminating rows earlier would be cheaper. If you sometimes filter by stamp but not by type_id, it may make sense to do this though.