What is the benefit of using nested data types in Parquet?

apache-spark nested parquet data-files

user976850 · Mar 25, 2018 · Viewed 7.9k times · Source

Is there any performance benefit resulting from the usage of using nested data types in the Parquet file format?

AFAIK Parquet files are usually created specifically for query services e.g. Athena, so the process which creates those might as well simply flatten the values - thereby allowing easier querying, simpler schema, and retaining the column statistics for each column.

What benefit is there to be gained by using nested data types e.g. struct?

Answer

There is a negative consequence keeping nested structure in parquet. The issue is spark predicate pushdown doesn't work properly if you have nested structure in the parquet file.

So even if you are working with few fields in your parquet dataset spark will load and materialize the entire dataset.

Here is the ticket which is opened for a long time regarding this issue.

EDIT

The issue has been resolved in spark 2.4 version.

What is the benefit of using nested data types in Parquet?

Answer

Related questions