How to overwrite the output directory in spark

Vijay Innamuri picture Vijay Innamuri · Nov 20, 2014 · Viewed 189.8k times · Source

I have a spark streaming application which produces a dataset for every minute. I need to save/overwrite the results of the processed data.

When I tried to overwrite the dataset org.apache.hadoop.mapred.FileAlreadyExistsException stops the execution.

I set the Spark property set("spark.files.overwrite","true") , but there is no luck.

How to overwrite or Predelete the files from spark?

Answer

samthebest picture samthebest · Nov 28, 2014

UPDATE: Suggest using Dataframes, plus something like ... .write.mode(SaveMode.Overwrite) ....

Handy pimp:

implicit class PimpedStringRDD(rdd: RDD[String]) {
    def write(p: String)(implicit ss: SparkSession): Unit = {
      import ss.implicits._
      rdd.toDF().as[String].write.mode(SaveMode.Overwrite).text(p)
    }
  }

For older versions try

yourSparkConf.set("spark.hadoop.validateOutputSpecs", "false")
val sc = SparkContext(yourSparkConf)

In 1.1.0 you can set conf settings using the spark-submit script with the --conf flag.

WARNING (older versions): According to @piggybox there is a bug in Spark where it will only overwrite files it needs to to write it's part- files, any other files will be left unremoved.