Is there a way to concurrently download S3 files using boto3 in Python3? I am aware of the aiobotocore library, but I would like to know if there is a way to do it using the standard boto3 library.
If you want to download lots of smaller files directly to disk in parallel using boto3
you can do so using the multiprocessing
module. Here's a little snippet that will do just that. You run it like: ./download.py bucket_name s3_key_0 s3_key_1 ... s3_key_n
#!/usr/bin/env python3
import multiprocessing
import boto3
import sys
# make a per process s3_client
s3_client = None
def initialize():
global s3_client
s3_client = boto3.client('s3')
# the work function of each process which will fetch something from s3
def download(job):
bucket, key, filename = job
s3_client.download_file(bucket, key, filename)
if __name__ == '__main__':
# make the jobs, arguments to program are: bucket s3_key_0 s3_key_1 ... s3_key_n
bucket = sys.argv[1]
jobs = [(bucket, key, key.replace('/', '_')) for key in sys.argv[2:] ]
# make a process pool to do the work
pool = multiprocessing.Pool(multiprocessing.cpu_count(), initialize)
pool.map(download, jobs)
pool.close()
pool.join()
One important piece of this is that we make an instance of an s3 client for every process that each process will reuse. This is important for 2 reasons. First, creating a client is slow so we want to do that as infrequently as possible. Secondly, clients should not be shared across processes as calls to download_file
may mutate internal state of the client.