I'm working on a project that incorporates file storage and sharing features and after months of researching the best method to leverage AWS I'm still a little concerned.
Basically my decision is between using EBS storage to house user files or S3. The system will incorporate on-the-fly zip archiving when the user wants to download a handful of files. Also, when users download any files I don't want the URL to the files exposed.
The two best options I've come up with are:
Have an EC2 instance which has a number of EBS volumes mounted to store user files.
After files are uploaded and processed, the system pushes those files to an S3 bucket for long term storage. When files are requested I will retrieve the files from S3 and output back to the client.
Are any of my assumptions flawed? Can anyone think of a better way of managing massive amounts of file storage?
If your service is going to be used by an undetermined number of users, it is important to bear in mind that scaleability will always be a concern, regardless of the option adopted, you will need to scale the service to meet demand, so it would be convenient assume that your service will be running in a Auto Scaling Group with a pool of EC2 instances and not a single instance.
Regarding the protection of the URL to allow only authorized users download the files, there are many ways to do this without requiring your service to act as an intermediate, then you will need to deal with at least two issues:
File name predictability: to avoid URL predictability, you could name the uploaded file as a hash and store the original filenames and ownerships in a database like SimpleDB, optionally you can set a http header such as "Content-Disposition: filename=original_file_name.ext" to advise users browser to name the downloaded file accordingly.
authorization: when the user ask to download a given file your service, issue a temporary authorization using Query String Authentication or Temporary Security Credentials for that specific user giving read access to the file for a period of time then your service redirects to the S3 bucket URL for direct download. This can greatly offload your EC2 pool instances, making then available to process other requests more quickly.
To reduce the space and traffic to your S3 bucket (remember you pay per GB stored and transferred), I would also recommend compressing each individual file using a standard algorithm like gzip before uploading to S3 and set the header " Content-Encoding: gzip " in order to make automatic uncompression work with users browser. If your programming language of choice is Java, I suggest taking a look at the plugin code webcache-s3-maven-plugin that I created to upload static resources from web projects.
Regarding the processing time in compressing a folder, you will frequently be unable to ensure that the folders are going to be compressed in short time, in order to allow the user to download it immediately, since eventually there could be huge folders that could take minutes or even hours to be compressed. For this I suggest you to use the SQS and SNS services in order to allow asynchronous compression processing, it would work as follows:
In this scenario you could have two Auto Scaling Groups, respectively frontend and backend, that may have different scaleability restrictions.