I understand hdfs will split files into something like 64mb chunks. We have data coming in streaming and we can store them to large files or medium sized files. What is the optimum size for columnar file storage? If I can store files to where the smallest column is 64mb, would it save any computation time over having, say, 1gb files?
Aim for around 1GB per file (spark partition) (1).
Ideally, you would use snappy compression (default) due to snappy compressed parquet files being splittable (2).
Using snappy instead of gzip will significantly increase the file size, so if storage space is an issue, that needs to be considered.
.option("compression", "gzip")
is the option to override the default snappy compression.
If you need to resize/repartition your Dataset/DataFrame/RDD, call the .coalesce(<num_partitions>
or worst case .repartition(<num_partitions>)
function. Warning: repartition especially but also coalesce can cause a reshuffle of the data, so use with some caution.
Also, parquet file size and for that matter all files generally should be greater in size than the HDFS block size (default 128MB).
1) https://forums.databricks.com/questions/101/what-is-an-optimal-size-for-file-partitions-using.html 2) http://boristyukin.com/is-snappy-compressed-parquet-file-splittable/