Fastest way to sync two Amazon S3 buckets

Question 1

Fastest way to sync two Amazon S3 buckets

amazon-web-services amazon-s3 amazon-ec2 aws-cli

mrt · Aug 25, 2016 · Viewed 24.2k times · Source

Answer

Answer

You can use EMR and S3-distcp. I had to sync 153 TB between two buckets and this took about 9 days. Also make sure the buckets are in the same region because you also get hit with data transfer costs.

aws emr add-steps --cluster-id <value> --steps Name="Command Runner",Jar="command-runner.jar",[{"Args":["s3-dist-cp","--s3Endpoint","s3.amazonaws.com","--src","s3://BUCKETNAME","--dest","s3://BUCKETNAME"]}]

http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/UsingEMR_s3distcp.html

http://docs.aws.amazon.com/ElasticMapReduce/latest/ReleaseGuide/emr-commandrunner.html

Question 2

I have a S3 bucket with around 4 million files taking some 500GB in total. I need to sync the files to a new bucket (actually changing the name of the bucket would suffice, but as that is not possible I need to create a new bucket, move the files there, and remove the old one).

I'm using AWS CLI's s3 sync command and it does the job, but takes a lot of time. I would like to reduce the time so that the dependent system downtime is minimal.

I was trying to run the sync both from my local machine and from EC2 c4.xlarge instance and there isn't much difference in time taken.

I have noticed that the time taken can be somewhat reduced when I split the job in multiple batches using --exclude and --include options and run them in parallel from separate terminal windows, i.e.

aws s3 sync s3://source-bucket s3://destination-bucket --exclude "*" --include "1?/*" 
aws s3 sync s3://source-bucket s3://destination-bucket --exclude "*" --include "2?/*" 
aws s3 sync s3://source-bucket s3://destination-bucket --exclude "*" --include "3?/*" 
aws s3 sync s3://source-bucket s3://destination-bucket --exclude "*" --include "4?/*" 
aws s3 sync s3://source-bucket s3://destination-bucket --exclude "1?/*" --exclude "2?/*" --exclude "3?/*" --exclude "4?/*"

Is there anything else I can do speed up the sync even more? Is another type of EC2 instance more suitable for the job? Is splitting the job into multiple batches a good idea and is there something like 'optimal' number of sync processes that can run in parallel on the same bucket?

Update

I'm leaning towards the strategy of syncing the buckets before taking the system down, do the migration, and then sync the buckets again to copy only the small number of files that changed in the meantime. However running the same sync command even on buckets with no differences takes a lot of time.

Fastest way to sync two Amazon S3 buckets

Answer

Related questions