I am trying to reproduce an Amazon EMR cluster on my local machine. For that purpose, I have installed the latest stable version of Hadoop as of now - 2.6.0. Now I would like to access an S3 bucket, as I do inside the EMR cluster.
I have added the aws credentials in core-site.xml:
<property>
<name>fs.s3.awsAccessKeyId</name>
<value>some id</value>
</property>
<property>
<name>fs.s3n.awsAccessKeyId</name>
<value>some id</value>
</property>
<property>
<name>fs.s3.awsSecretAccessKey</name>
<value>some key</value>
</property>
<property>
<name>fs.s3n.awsSecretAccessKey</name>
<value>some key</value>
</property>
Note: Since there are some slashes on the key, I have escaped them with %2F
If I try to list the contents of the bucket:
hadoop fs -ls s3://some-url/bucket/
I get this error:
ls: No FileSystem for scheme: s3
I edited core-site.xml again, and added information related to the fs:
<property>
<name>fs.s3.impl</name>
<value>org.apache.hadoop.fs.s3.S3FileSystem</value>
</property>
<property>
<name>fs.s3n.impl</name>
<value>org.apache.hadoop.fs.s3native.NativeS3FileSystem</value>
</property>
This time I get a different error:
-ls: Fatal internal error
java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3.S3FileSystem not found
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2074)
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2578)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2591)
Somehow I suspect the Yarn distribution does not have the necessary jars to be able to read S3, but I have no idea where to get those. Any pointers in this direction would be greatly appreciated.
For some reason, the jar hadoop-aws-[version].jar
which contains the implementation to NativeS3FileSystem
is not present in the classpath
of hadoop by default in the version 2.6 & 2.7. So, try and add it to the classpath by adding the following line in hadoop-env.sh
which is located in $HADOOP_HOME/etc/hadoop/hadoop-env.sh
:
export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:$HADOOP_HOME/share/hadoop/tools/lib/*
Assuming you are using Apache Hadoop 2.6 or 2.7
By the way, you could check the classpath of Hadoop using:
bin/hadoop classpath