I've found at that Instagram share their technology implementation with other developers trough their blog. They've some great solutions for the problems they run into. One of those solutions they've is an Elastic Load Balancer on Amazon with 3 nginx instances behind it. What is the task of those nginx servers? And what is the task of the Elastic Load balancers, and what is the relation between them?
Disclaimed: I am no expert on this in any way and am on the process of learning about AWS ecosystem myself.
The ELB (Elastic load balancer) has no functionality on its own except receiving the requests and routing it to the right server. The servers can run nginx, IIS, Apache, lighthttpd, you name it.
I will give you a real use case.
I had one nginx server running one wordpress blog. This server was, like I said, powered by nginx serving static content and "upstreaming" .php requests to phpfpm running on the same server. Everything was going fine until one day. This blog was featured on a tv show. I had a ton of users and the server could not keep up with that much traffic. My first reaction would be to just use the AMI (Amazon machine image) to spin up a copy of my server on more powerful instance like m1.heavy. Problem was I knew I would have traffic increasing over time over the next couple of days. Soon I would have to spin a even more powerful machine, that would mean more downtime and trouble. Instead, I launched an ELB (elastic load balancer) and updated my DNS to point website traffic to the ELB instead of directly to the server. The user don’t know server ip or anything, he only sees the ELB, everything else goes on inside amazon’s cloud. The ELB decides to which server the traffic goes. You can have ELB and only one server on at the time (if you traffic is low at the moment), or hundreds. Servers can be created and added to the server array (server group) at any time, or you can configure auto scaling to spawn new servers and add them to the ELB Server group using amazon command line, all automaticaly.
Amazon cloud watch (another product and important part of AWS ecosystem) is always watching your server’s health and decides to which server it will route that user. It also knows when all the servers are becoming too loaded and is the agent that gives the order to spawn another server (using your AMI). When the servers are not under heavy load anymore they are automatically destroyed (or stopped, I don’t recall).
This way I was able to serve all users at all times, and when the load was light, I would have ELB and only one nginx server. When the load was high I would let it decide how many servers I need (according to server load). Minimal downtime. Of course you can set limits to how much servers you can afford at the same time and stuff like that so you don’t get billed over what you can pay.
You see, Instagram guys said the following - "we used to run 2 nginx machines and DNS Round-Robin between them". This is inefficient IMO compared to ELB. DNS Round Robin is dns routing each request to a different server. So first goes to server one, second goes to server two and on and on. ELB actually watches the servers HEALTH (cpu usage, network usage) and decides to which server traffic goes based on that. Do you see the difference? And they say: "The downside of this approach is the time it takes for DNS to update in case one of the machines needs to get decommissioned." DNS Round robin is a form of load balancer. But if one server goes kaput and you need to update DNS to remove this server from the server group, you will have downtime (DNS takes time to update to the whole world). Some users will get routed to this bad server. With ELB this is automatic - if the server is in bad health it does not receive any more traffic - unless of course the whole group of servers is in bad health and you do not have any kind of auto-scaling setup.
And now the guys at instagram: "Recently, we moved to using Amazon’s Elastic Load Balancer, with 3 NGINX instances behind it that can be swapped in and out (and are automatically taken out of rotation if they fail a health check).".
The scenario I illustrated is fictional. It is actually more complex than that but nothing that cannot be solved. For instance, if users upload pictures to your application, how can you keep consistency between all the machines on the server group? You would need to store the images on a external service like Amazon s3. On another post on Instagram engineering – “The photos themselves go straight to Amazon S3, which currently stores several terabytes of photo data for us.”. If they have 3 nginx servers on the load balancer and all servers serve html pages on which the links for images points to S3, you will have no problem. If the image is stored locally on the instance – no way to do it. All servers on the ELB would also need an external database. For that amazon has RDS – All machines can point to the same database and data consistency would be guaranteed. On the image above, you can see a RDS "Read replica" - that is RDS way of load balancing. I dont know much about that at this time, sorry.
Try and read this: http://awsadvent.tumblr.com/post/38043683444/using-elb-and-auto-scaling
Best regards.