How to upload small files to Amazon S3 efficiently in Python

Question 1

How to upload small files to Amazon S3 efficiently in Python

python amazon-web-services amazon-s3 coroutine

Jacky1205 · Dec 15, 2014 · Viewed 10.5k times · Source

Answer

Answer

Sample parallel upload times to Amazon S3 using the Python boto SDK are available here:

Parallel S3 Uploads Using Boto and Threads in Python

Rather than writing the code yourself, you might also consider calling out to the AWS Command Line Interface (CLI), which can do uploads in parallel. It is also written in Python and uses boto.

Question 2

Recently, I need to implement a program to upload files resides in Amazon EC2 to S3 in Python as quickly as possible. And the size of files are 30KB.

I have tried some solutions, using multiple threading, multiple processing, co-routine. The following is my performance test result on Amazon EC2.

3600 (the amount of files) * 30K (file size) ~~ 105M (Total) --->

       **5.5s [ 4 process + 100 coroutine ]**
       10s  [ 200 coroutine ]
       14s  [ 10 threads ]

The code as following shown

For multithreading

def mput(i, client, files):
    for f in files:
        if hash(f) % NTHREAD == i:
            put(client, os.path.join(DATA_DIR, f))


def test_multithreading():
    client = connect_to_s3_sevice()
    files = os.listdir(DATA_DIR)
    ths = [threading.Thread(target=mput, args=(i, client, files)) for i in range(NTHREAD)]
    for th in ths:
        th.daemon = True
        th.start()
    for th in ths:
        th.join()

For coroutine

client = connect_to_s3_sevice()
pool = eventlet.GreenPool(int(sys.argv[2]))

xput = functools.partial(put, client)
files = os.listdir(DATA_DIR)
for f in files:
    pool.spawn_n(xput, os.path.join(DATA_DIR, f))
pool.waitall()

For multiprocessing + Coroutine

def pproc(i):
    client = connect_to_s3_sevice()
    files = os.listdir(DATA_DIR)
    pool = eventlet.GreenPool(100)

    xput = functools.partial(put, client)
    for f in files:
        if hash(f) % NPROCESS == i:
            pool.spawn_n(xput, os.path.join(DATA_DIR, f))
    pool.waitall()


def test_multiproc():
    procs = [multiprocessing.Process(target=pproc, args=(i, )) for i in range(NPROCESS)]
    for p in procs:
        p.daemon = True
        p.start()
    for p in procs:
        p.join()

The configuration of the machine is Ubuntu 14.04, 2 CPUs (2.50GHz), 4G Memory

The highest speed reached is about 19Mb/s (105 / 5.5). Overall, it is too slow. Any way to speed it up? Does stackless python could do it faster?

How to upload small files to Amazon S3 efficiently in Python

Answer

Related questions