distcp from Hadoop to S3 fails with "No space available in any of the local directories"

Zack picture Zack · Jun 16, 2016 · Viewed 7k times · Source

I'm trying to copy data from a local hadoop cluster to an S3 bucket using distcp.

Sometimes it "works", but some of the mappers fail, with the stack trace below. Other times, so many mappers fail that the whole job cancels.

The error "No space available in any of the local directories." doesn't make sense to me. There is PLENTY of space on the edge node (where the distcp command is running), on the cluster, and in the S3 bucket.

Can anyone shed some light on this?

16/06/16 15:48:08 INFO mapreduce.Job: The url to track the job: <url>
16/06/16 15:48:08 INFO tools.DistCp: DistCp job-id: job_1465943812607_0208
16/06/16 15:48:08 INFO mapreduce.Job: Running job: job_1465943812607_0208
16/06/16 15:48:16 INFO mapreduce.Job: Job job_1465943812607_0208 running in uber mode : false
16/06/16 15:48:16 INFO mapreduce.Job:  map 0% reduce 0%
16/06/16 15:48:23 INFO mapreduce.Job:  map 33% reduce 0%
16/06/16 15:48:26 INFO mapreduce.Job: Task Id : attempt_1465943812607_0208_m_000001_0, Status : FAILED
Error: java.io.IOException: File copy failed: hdfs://<hdfs path>/000000_0 --> s3n://<bucket>/<s3 path>/000000_0
        at org.apache.hadoop.tools.mapred.CopyMapper.copyFileWithRetry(CopyMapper.java:285)
        at org.apache.hadoop.tools.mapred.CopyMapper.map(CopyMapper.java:253)
        at org.apache.hadoop.tools.mapred.CopyMapper.map(CopyMapper.java:50)
        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1709)
        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)
Caused by: java.io.IOException: Couldn't run retriable-command: Copying hdfs://<hdfs path>/000000_0 to s3n://<bucket>/<s3 path>/000000_0
        at org.apache.hadoop.tools.util.RetriableCommand.execute(RetriableCommand.java:101)
        at org.apache.hadoop.tools.mapred.CopyMapper.copyFileWithRetry(CopyMapper.java:281)
        ... 10 more
Caused by: org.apache.hadoop.util.DiskChecker$DiskErrorException: No space available in any of the local directories.
        at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:366)
        at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.createTmpFileForWrite(LocalDirAllocator.java:416)
        at org.apache.hadoop.fs.LocalDirAllocator.createTmpFileForWrite(LocalDirAllocator.java:198)
        at org.apache.hadoop.fs.s3native.NativeS3FileSystem$NativeS3FsOutputStream.newBackupFile(NativeS3FileSystem.java:263)
        at org.apache.hadoop.fs.s3native.NativeS3FileSystem$NativeS3FsOutputStream.<init>(NativeS3FileSystem.java:245)
        at org.apache.hadoop.fs.s3native.NativeS3FileSystem.create(NativeS3FileSystem.java:412)
        at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:986)
        at org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.copyToFile(RetriableFileCopyCommand.java:174)
        at org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.doCopy(RetriableFileCopyCommand.java:123)
        at org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.doExecute(RetriableFileCopyCommand.java:99)
        at org.apache.hadoop.tools.util.RetriableCommand.execute(RetriableCommand.java:87)
        ... 11 more

Answer

Noam Barkai picture Noam Barkai · Jul 4, 2016

We ran into a similar exception while trying to save directly the results of a run from Apache Spark (version 1.5.2) to S3. The exception is the same though. I'm not really sure what the core issue is - somehow S3 upload doesn't seem to "play nice" with Hadoop's LocalDirAllocator class (version 2.7).

What finally solved it for us was a combination of:

  1. enabling S3's "fast upload" - by setting "fs.s3a.fast.upload" to "true" in Hadoop configuration. This uses S3AFastOutputStream instead of S3AOutputStream and uploads data directly from memory, instead of first allocating local storage

  2. merging the results of the job to a single part before saving to s3 (in Spark that's called repartitioning/coalescing)

Some caveats though:

  1. S3's fast upload is apparently marked "experimental" in Hadoop 2.7

  2. this work-around only applies to the newer s3a file-system ("s3a://..."). it won't work for the older "native" s3n file-system ("s3n://...")

hope this helps