Overwrite specific partitions in spark dataframe write method

Question 1

Overwrite specific partitions in spark dataframe write method

apache-spark apache-spark-sql spark-dataframe

yatin · Jul 20, 2016 · Viewed 79.2k times · Source

Answer

Answer

Finally! This is now a feature in Spark 2.3.0: https://issues.apache.org/jira/browse/SPARK-20236

To use it, you need to set the spark.sql.sources.partitionOverwriteMode setting to dynamic, the dataset needs to be partitioned, and the write mode overwrite. Example:

spark.conf.set("spark.sql.sources.partitionOverwriteMode","dynamic")
data.write.mode("overwrite").insertInto("partitioned_table")

I recommend doing a repartition based on your partition column before writing, so you won't end up with 400 files per folder.

Before Spark 2.3.0, the best solution would be to launch SQL statements to delete those partitions and then write them with mode append.

Question 2

I want to overwrite specific partitions instead of all in spark. I am trying the following command:

df.write.orc('maprfs:///hdfs-base-path','overwrite',partitionBy='col4')

where df is dataframe having the incremental data to be overwritten.

hdfs-base-path contains the master data.

When I try the above command, it deletes all the partitions, and inserts those present in df at the hdfs path.

What my requirement is to overwrite only those partitions present in df at the specified hdfs path. Can someone please help me in this?

Overwrite specific partitions in spark dataframe write method

Answer

Related questions