Need help deciding between EBS vs S3 on Amazon Web Services

andrewvnice picture andrewvnice · Aug 11, 2012 · Viewed 16.6k times · Source

I'm working on a project that incorporates file storage and sharing features and after months of researching the best method to leverage AWS I'm still a little concerned.

Basically my decision is between using EBS storage to house user files or S3. The system will incorporate on-the-fly zip archiving when the user wants to download a handful of files. Also, when users download any files I don't want the URL to the files exposed.

The two best options I've come up with are:

  1. Have an EC2 instance which has a number of EBS volumes mounted to store user files.

    • pros: It seems much faster than S3, and zipping files from the EBS volume is straight forward.
    • cons: I believe Amazon caps how much EBS storage you can use and there is not as redundant as S3.
  2. After files are uploaded and processed, the system pushes those files to an S3 bucket for long term storage. When files are requested I will retrieve the files from S3 and output back to the client.

    • pros: Redundancy, no file storage limits
    • cons: It seems very SLOW, no way to mount an S3 bucket as a volume in filesystem, serving zipped files would mean transferring each file to the EC2 instance, zipping, and then finally sending output (again, slow!)

Are any of my assumptions flawed? Can anyone think of a better way of managing massive amounts of file storage?

Answer

Alessandro Oliveira picture Alessandro Oliveira · Aug 11, 2012

If your service is going to be used by an undetermined number of users, it is important to bear in mind that scaleability will always be a concern, regardless of the option adopted, you will need to scale the service to meet demand, so it would be convenient assume that your service will be running in a Auto Scaling Group with a pool of EC2 instances and not a single instance.

Regarding the protection of the URL to allow only authorized users download the files, there are many ways to do this without requiring your service to act as an intermediate, then you will need to deal with at least two issues:

  1. File name predictability: to avoid URL predictability, you could name the uploaded file as a hash and store the original filenames and ownerships in a database like SimpleDB, optionally you can set a http header such as "Content-Disposition: filename=original_file_name.ext" to advise users browser to name the downloaded file accordingly.

  2. authorization: when the user ask to download a given file your service, issue a temporary authorization using Query String Authentication or Temporary Security Credentials for that specific user giving read access to the file for a period of time then your service redirects to the S3 bucket URL for direct download. This can greatly offload your EC2 pool instances, making then available to process other requests more quickly.

To reduce the space and traffic to your S3 bucket (remember you pay per GB stored and transferred), I would also recommend compressing each individual file using a standard algorithm like gzip before uploading to S3 and set the header " Content-Encoding: gzip " in order to make automatic uncompression work with users browser. If your programming language of choice is Java, I suggest taking a look at the plugin code webcache-s3-maven-plugin that I created to upload static resources from web projects.

Regarding the processing time in compressing a folder, you will frequently be unable to ensure that the folders are going to be compressed in short time, in order to allow the user to download it immediately, since eventually there could be huge folders that could take minutes or even hours to be compressed. For this I suggest you to use the SQS and SNS services in order to allow asynchronous compression processing, it would work as follows:

  1. user requests folder compression
  2. the frontend EC2 instance creates a compression request in an SQS queue
  3. a backend EC2 instance, consumes the compression request of the SQS queue
  4. the backend instance downloads the files from S3 to a EBS drive, since the generated files will be temporary I would suggest to choose to use at least m1.small instances with ephemeral type disks, which are local to the virtual machine in order to reduce I/O latency and the processing time.
  5. after the compressed file is generated, the service uploads the file to the S3 bucket, optionally setting the Object Expiration properties, that will tell S3 bucket to delete the file automatically after a certain period of time (again to reduce your storage costs), and publishes a notification that the file is ready to be downloaded in a SNS topic.
  6. if the user is still online, read the notification from the topic, and notify the user that the zip file is ready to be downloaded, if after a while this notification did not arrive, you can tell the user that compression is taking longer than expected and the service will notify him by e-mail as soon as the file is ready to be downloaded.

In this scenario you could have two Auto Scaling Groups, respectively frontend and backend, that may have different scaleability restrictions.