Amazon S3 architecture

amazon-s3 couchdb amazon hadoop distributed-system

Sukumar · Feb 19, 2009 · Viewed 22.2k times · Source

While the post @ http://highscalability.com/amazon-architecture explains Amazon's architecture in general, I am interested in knowing how Amazon S3 is implemented.

Some of my guesses are

A distributed file system like HDFS http://hadoop.apache.org/core/docs/current/hdfs_design.html
A non relational persistent DB like CouchDB http://couchdb.apache.org/

Is it be possible to implement something similar to this on a much smaller scale using scripting languages like Python or PHP?

Answer

Amazon S3 is implemented using the architecture described in the Dynamo Paper:

http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html

The paper explains consistent hashing, and how and why the guarantee is "eventual consistency".

The conflict resolution they talk about for Dynamo is not exposed to users of S3. It is used internally in Amazon's applications, but for S3, the only conflict resolution is last write wins.

Edit: Werner Vogels has said "Dynamo is not directly exposed externally as a web service; however, Dynamo and similar Amazon technologies are used to power parts of our Amazon Web Services, such as S3." http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html

I would emphasize that he isn't saying S3 and Dynamo share components, he explicitly says that Dynamo itself is one of the technologies that power S3. Everything I've seen from S3, including the caveats, is accounted for by assuming S3 is a fancy web services wrapper around Dynamo with authentication, accounting, and a last-write-wins conflict resolve that is invisible to the user.

The original question was about the underlying storage mechanism for S3. It is explicitly not a distributed file system like HDFS or a non-relational database like CouchDB. Dynamo fills this role.

Amazon S3 architecture

Answer

Related questions