How to mount S3 bucket on Kubernetes container/pods?

Ajeet picture Ajeet · Aug 3, 2018 · Viewed 13.9k times · Source

I am trying to run my spark job on Amazon EKS cluster. My spark job required some static data (reference data) at each data nodes/worker/executor and this reference data is available at S3.

Can somebody kindly help me to find out a clean and performant solution to mount S3 bucket on pods ?

S3 API is an option and I am using it for my input records and output results. But "Reference data" is static data so I dont want to download it in each run/execution of my spark job. In first run job will download the data and upcoming jobs will check if data is already available locally and there is no need to download it again.

Answer

Yiannis Gkoufas picture Yiannis Gkoufas · Jun 4, 2020

We recently opensourced a project that looks to automate this steps for you: https://github.com/IBM/dataset-lifecycle-framework

Basically you can create a dataset:

apiVersion: com.ie.ibm.hpsys/v1alpha1
kind: Dataset
metadata:
  name: example-dataset
spec:
  local:
    type: "COS"
    accessKeyID: "iQkv3FABR0eywcEeyJAQ"
    secretAccessKey: "MIK3FPER+YQgb2ug26osxP/c8htr/05TVNJYuwmy"
    endpoint: "http://192.168.39.245:31772"
    bucket: "my-bucket-d4078283-dc35-4f12-a1a3-6f32571b0d62"
    region: "" #it can be empty

And then you will get a pvc you can mount in your pods