Overwrite parquet files from dynamic frame in AWS Glue

Mateo Rod picture Mateo Rod · Aug 24, 2018 · Viewed 9.6k times · Source

I use dynamic frames to write a parquet file in S3 but if a file already exists my program append a new file instead of replace it. The sentence that I use is this:

glueContext.write_dynamic_frame.from_options(frame = table,
                                         connection_type = "s3",
                                         connection_options = {"path": output_dir,
                                                               "partitionKeys": ["var1","var2"]},
                                         format = "parquet")

Is there anything like "mode":"overwrite" that replace my parquet files?

Answer

Yuriy Bondaruk picture Yuriy Bondaruk · Aug 25, 2018

Currently AWS Glue doesn't support 'overwrite' mode but they are working on this feature.

As a workaround you can convert DynamicFrame object to spark's DataFrame and write it using spark instead of Glue:

table.toDF()
  .write
  .mode("overwrite")
  .format("parquet")
  .partitionBy("var_1", "var_2")
  .save(output_dir)