Ho to read ".gz" compressed file using spark DF or DS?

apache-spark apache-spark-sql spark-dataframe gzip apache-spark-dataset

prady · Mar 26, 2018 · Viewed 14.4k times · Source

I have a compressed file with .gz format, Is it possible to read the file directly using spark DF/DS?

Details : File is csv with tab delimited.

Answer

Reading a compressed csv is done in the same way as reading an uncompressed csv file. For Spark version 2.0+ it can be done as follows using Scala (note the extra option for the tab delimiter):

val df = spark.read.option("sep", "\t").csv("file.csv.gz")

PySpark:

df = spark.read.csv("file.csv.gz", sep='\t')

The only extra consideration to take into account is that the gz file is not splittable, therefore Spark needs to read the whole file using a single core which will slow things down. After the read is done the data can be shuffled to increase parallelism.

Ho to read ".gz" compressed file using spark DF or DS?

Answer

Related questions