Why does Redshift not need materialized views or indexes?

m0meni picture m0meni · May 31, 2016 · Viewed 10.3k times · Source

In the Redshift FAQ under

Q: How does the performance of Amazon Redshift compare to most traditional databases for data warehousing and analytics?

It says the following:

Advanced Compression: Columnar data stores can be compressed much more than row-based data stores because similar data is stored sequentially on disk. Amazon Redshift employs multiple compression techniques and can often achieve significant compression relative to traditional relational data stores. In addition, Amazon Redshift doesn't require indexes or materialized views and so uses less space than traditional relational database systems. When loading data into an empty table, Amazon Redshift automatically samples your data and selects the most appropriate compression scheme.

Why is this the case?

Answer

MatBailie picture MatBailie · May 31, 2016

It's a bit disingenuous to be honest (in my opinion). Although RedShift has neither of these, I'm not sure that's the same as saying it wouldn't benefit from them.

Materialised Views

I have no real idea why they make this claim. Possibly because they consider the engine so performant that the gains from having them are minimal.

I would dispute this and the product I work on maintains its own materialised views and can show significant performance gains from doing so. Perhaps AWS believe I must be doing something wrong in the first place?

Indexes

RedShift does not have indexes.

It does have SORT ORDER which is exceptionally similar to a clustered index. It is simply a list of fields by which the data is ordered (like a composite clustered index).

It even has recently introduced INTERLEAVED SORT KEYS. This is a direct attempt to have multiple independent sort orders. Instead of ordering by a THEN b THEN c it effectively orders by each of them at the same time.

That becomes kind of possible because of how RedShift implements its column store.
- Each column is stored separately from each other column
- Each column is stored in 1MB blocks
- Each 1MB block has summary statistics

As well as being the storage pattern this effectively becomes a set of pseudo indexes.
- If the data is sorted by a then b then x
- But you want z = 1234
- RedShift looks at the block statistics (for column z) first
- Those stats will say the minimum and maximum values stored by that block
- This allows Redshift to skip many of those blocks in certain conditions
- This intern allows RedShift to identify which blocks to read from the other columns