Recently, I need to implement a program to upload files resides in Amazon EC2 to S3 in Python as quickly as possible. And the size of files are 30KB.
I have tried some solutions, using multiple threading, multiple processing, co-routine. The following is my performance test result on Amazon EC2.
3600 (the amount of files) * 30K (file size) ~~ 105M (Total) --->
**5.5s [ 4 process + 100 coroutine ]**
10s [ 200 coroutine ]
14s [ 10 threads ]
The code as following shown
For multithreading
def mput(i, client, files):
for f in files:
if hash(f) % NTHREAD == i:
put(client, os.path.join(DATA_DIR, f))
def test_multithreading():
client = connect_to_s3_sevice()
files = os.listdir(DATA_DIR)
ths = [threading.Thread(target=mput, args=(i, client, files)) for i in range(NTHREAD)]
for th in ths:
th.daemon = True
th.start()
for th in ths:
th.join()
For coroutine
client = connect_to_s3_sevice()
pool = eventlet.GreenPool(int(sys.argv[2]))
xput = functools.partial(put, client)
files = os.listdir(DATA_DIR)
for f in files:
pool.spawn_n(xput, os.path.join(DATA_DIR, f))
pool.waitall()
For multiprocessing + Coroutine
def pproc(i):
client = connect_to_s3_sevice()
files = os.listdir(DATA_DIR)
pool = eventlet.GreenPool(100)
xput = functools.partial(put, client)
for f in files:
if hash(f) % NPROCESS == i:
pool.spawn_n(xput, os.path.join(DATA_DIR, f))
pool.waitall()
def test_multiproc():
procs = [multiprocessing.Process(target=pproc, args=(i, )) for i in range(NPROCESS)]
for p in procs:
p.daemon = True
p.start()
for p in procs:
p.join()
The configuration of the machine is Ubuntu 14.04, 2 CPUs (2.50GHz), 4G Memory
The highest speed reached is about 19Mb/s (105 / 5.5). Overall, it is too slow. Any way to speed it up? Does stackless python could do it faster?
Sample parallel upload times to Amazon S3 using the Python boto SDK are available here:
Rather than writing the code yourself, you might also consider calling out to the AWS Command Line Interface (CLI), which can do uploads in parallel. It is also written in Python and uses boto.