AWS Glue: How to add a column with the source filename in the output?

Question 1

AWS Glue: How to add a column with the source filename in the output?

amazon-web-services apache-spark pyspark aws-glue

markwatson · May 10, 2018 · Viewed 7.4k times · Source

Answer

Answer

You can do it with spark in your etl job:

var df = glueContext.getCatalogSource(
  database = database,
  tableName = table,
  transformationContext = s"source-$database.$table"
).getDynamicFrame()
 .toDF()
 .withColumn("input_file_name", input_file_name())

glueContext.getSinkWithFormat(
  connectionType = "s3",
  options = JsonOptions(Map(
    "path" -> args("DST_S3_PATH")
  )),
  transformationContext = "",
  format = "parquet"
).writeDynamicFrame(DynamicFrame(df, glueContext))

Remember it works with getCatalogSource() API only and not with create_dynamic_frame_from_options()

Question 2

Does anyone know of a way to add the source filename as a column in a Glue job?

We created a flow where we crawled some files in S3 to create a schema. We then wrote a job that transforms the files to a new format, and the writes those files back to another S3 bucket as CSV, to be used by the rest of our pipeline. What we would like to do is get access to some sort of job meta properties so we can add a new column to the output file that contains the original filename.

I looked through the AWS documentation and the aws-glue-libs source, but didn't see anything that jumped out. Ideally there would be some way to get metadata from the awsglue.job package (we're using the python flavor).

I'm still learning Glue, so apologies if I'm using the wrong terminology. I tagged this with the spark tag as well, because I believe that's what Glue is using under the covers.

AWS Glue: How to add a column with the source filename in the output?

Answer

Related questions