Index in Parquet

Sjoerd van Hagen picture Sjoerd van Hagen · Nov 13, 2014 · Viewed 19.9k times · Source

I would like to be able to do a fast range query on a Parquet table. The amount of data to be returned is very small compared to the total size but because a full column scan has to be performed it is too slow for my use case.

Using an index would solve this problem and I read that this was to be added in Parquet 2.0. However, I cannot find any other information on this so I am guessing that it was not. I do not think that there would be any fundamental obstacles preventing the addition of (multi-column) indexes, if the data were sorted, which in my case it is.

My question is: when will indexes be added to Parquet, and what would be the high level design for doing so? I think I would already be happy with an index that points out the correct partition.

Kind regards,

Sjoerd.

Answer

blue picture blue · May 22, 2015

Parquet currently keeps min/max statistics for each data page. A data page is a group of ~1MB of values (after encoding) for a single column; multiple pages are what make up Parquet's column chunks.

Those min/max values are used to filter both column chunks and the pages that make up a chunk. So you should be able to improve your query time by sorting records by the columns you want to filter on, then writing the data into Parquet. That way, you get the most out of the stats filtering.

You can also get more granular filtering with this technique by decreasing the page and row group sizes, though you're then trading encoding efficiency and I/O efficiency.