Emrfs file sync with s3 not working

amazon-s3 pyspark amazon-emr

sakurashinken · Oct 3, 2016 · Viewed 13.5k times · Source

After running a spark job on an Amazon EMR cluster, I deleted the output files directly from s3 and tried to rerun the job again. I received the following error upon trying to write to parquet file format on s3 using sqlContext.write:

'bucket/folder' present in the metadata but not s3
at com.amazon.ws.emr.hadoop.fs.consistency.ConsistencyCheckerS3FileSystem.getFileStatus(ConsistencyCheckerS3FileSystem.java:455)

I tried running

emrfs sync s3://bucket/folder

which did not appear to resolve the error even though it did remove some records from the DynamoDB instance that keeps track of the metadata. Not sure what else I can try. How do I resolve this error?

Answer

It turned out that I needed to run

emrfs delete s3://bucket/folder

first before running sync. Running the above solved the issue.

Emrfs file sync with s3 not working

Answer

Related questions