I have a S3 bucket with around 4 million files taking some 500GB in total. I need to sync the files to a new bucket (actually changing the name of the bucket would suffice, but as that is not possible I need to create a new bucket, move the files there, and remove the old one).
I'm using AWS CLI's s3 sync
command and it does the job, but takes a lot of time. I would like to reduce the time so that the dependent system downtime is minimal.
I was trying to run the sync both from my local machine and from EC2 c4.xlarge
instance and there isn't much difference in time taken.
I have noticed that the time taken can be somewhat reduced when I split the job in multiple batches using --exclude
and --include
options and run them in parallel from separate terminal windows, i.e.
aws s3 sync s3://source-bucket s3://destination-bucket --exclude "*" --include "1?/*"
aws s3 sync s3://source-bucket s3://destination-bucket --exclude "*" --include "2?/*"
aws s3 sync s3://source-bucket s3://destination-bucket --exclude "*" --include "3?/*"
aws s3 sync s3://source-bucket s3://destination-bucket --exclude "*" --include "4?/*"
aws s3 sync s3://source-bucket s3://destination-bucket --exclude "1?/*" --exclude "2?/*" --exclude "3?/*" --exclude "4?/*"
Is there anything else I can do speed up the sync even more? Is another type of EC2
instance more suitable for the job? Is splitting the job into multiple batches a good idea and is there something like 'optimal' number of sync
processes that can run in parallel on the same bucket?
Update
I'm leaning towards the strategy of syncing the buckets before taking the system down, do the migration, and then sync the buckets again to copy only the small number of files that changed in the meantime. However running the same sync
command even on buckets with no differences takes a lot of time.
You can use EMR and S3-distcp. I had to sync 153 TB between two buckets and this took about 9 days. Also make sure the buckets are in the same region because you also get hit with data transfer costs.
aws emr add-steps --cluster-id <value> --steps Name="Command Runner",Jar="command-runner.jar",[{"Args":["s3-dist-cp","--s3Endpoint","s3.amazonaws.com","--src","s3://BUCKETNAME","--dest","s3://BUCKETNAME"]}]
http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/UsingEMR_s3distcp.html
http://docs.aws.amazon.com/ElasticMapReduce/latest/ReleaseGuide/emr-commandrunner.html