Locally reading S3 files through Spark (or better: pyspark)

Question 1

Locally reading S3 files through Spark (or better: pyspark)

authentication amazon-s3 apache-spark credentials pyspark

Eric O Lebigot · Apr 4, 2015 · Viewed 33k times · Source

Answer

Answer

Yes, you have to use s3n instead of s3. s3 is some weird abuse of S3 the benefits of which are unclear to me.

You can pass the credentials to the sc.hadoopFile or sc.newAPIHadoopFile calls:

rdd = sc.hadoopFile('s3n://my_bucket/my_file', conf = {
  'fs.s3n.awsAccessKeyId': '...',
  'fs.s3n.awsSecretAccessKey': '...',
})

Question 2

I want to read an S3 file from my (local) machine, through Spark (pyspark, really). Now, I keep getting authentication errors like

java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3n URL, or by setting the fs.s3n.awsAccessKeyId or fs.s3n.awsSecretAccessKey properties (respectively).

I looked everywhere here and on the web, tried many things, but apparently S3 has been changing over the last year or months, and all methods failed but one:

pyspark.SparkContext().textFile("s3n://user:password@bucket/key")

(note the s3n [s3 did not work]). Now, I don't want to use a URL with the user and password because they can appear in logs, and I am also not sure how to get them from the ~/.aws/credentials file anyway.

So, how can I read locally from S3 through Spark (or, better, pyspark) using the AWS credentials from the now standard ~/.aws/credentials file (ideally, without copying the credentials there to yet another configuration file)?

PS: I tried os.environ["AWS_ACCESS_KEY_ID"] = … and os.environ["AWS_SECRET_ACCESS_KEY"] = …, it did not work.

PPS: I am not sure where to "set the fs.s3n.awsAccessKeyId or fs.s3n.awsSecretAccessKey properties" (Google did not come up with anything). However, I did try many ways of setting these: SparkContext.setSystemProperty(), sc.setLocalProperty(), and conf = SparkConf(); conf.set(…); conf.set(…); sc = SparkContext(conf=conf). Nothing worked.

Locally reading S3 files through Spark (or better: pyspark)

Answer

Related questions