I have an application which is supposed to copy over a large number of files from a source such as S3 into HDFS. The application uses apache distcp within and copies each individual file from the source via streaming into HDFS.
Each individual file is around ~1GB, has 1K columns of strings. When I choose to copy over all the columns, the write fails with the following error :-
2014-05-20 23:57:35,939 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery for block null bad datanode[0] nodes == null
2014-05-20 23:57:35,939 WARN org.apache.hadoop.hdfs.DFSClient: Could not get block locations. Source file "/xyz/2014/01/02/control-Jan-2014-14.gz" - Aborting...
2014-05-20 23:57:54,369 ERROR abc.mapred.distcp.DistcpRunnable: org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease on /xyz/2014/01/02/control-Jan-2014-14.gz File does not exist. [Lease. Holder: DFSClient_attempt_201403272055_15994_m_000004_0_-1476619343_1, pendingcreates: 4]
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1720)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1711)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1619)
at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:736)
at sun.reflect.GeneratedMethodAccessor41.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:578)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1393)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1389)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1149)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1387)
I believe that it is due to it taking too much time to write one large file from source into HDFS. When I modify the application to copy over only 50,100 or 200 columns, the application runs to completion. Application fails when number of columns being copied for each row > 200.
I have no control over the source files.
I can not seem to find anything around increasing lease expiration.
Any Pointers ?
So Finally I could determine what is going on. So from the source, S3, our application was downloading files like
/xyz/2014/01/week1/abc
/xyz/2014/01/week1/def
/xyz/2014/01/week2/abc
/xyz/2014/01/week2/def
/xyz/2014/01/week3/abc
/xyz/2014/01/week3/def
Notice the same file names across different weeks. And then each of these files were being written to HDFS using the DFSClient. So essentially multiple mappers were trying to write the "same file" (because of the same file name like abc, def) even though the files were actually different. As the client has to acquire a lease before writing the file and as the client writing the first of the "abc" file was not releasing the lease while during the writing process, the other client trying to write the other "abc" file was throwing the LeaseExpriedException with the Lease Mismatch Message.
But this still does not explain why the client which first acquired the lease for the write did not succeed. I mean I would expect in such a case that the first writers of every such files to succeed. Any explanation ?