How to increase network bandwidth of AWS EC2 instance?

Srini K picture Srini K · Mar 31, 2015 · Viewed 17.2k times · Source

We hosted a site in AWS EC2 of type c4.8xlarge. It is a fairly large system with lot of memory and compute resources. Thousands of users tried to access the system during a 2 hour timeframe this weekend. While it did not crash, it slowed down quite a bit and failed to perform at the expected level. Analyzing the stats showed that limited network bandwidth is the main cause of the slowdown. The CPU usage stayed below 6%, but NetworkIn and NetworkOut seem to have peaked at 60MB and 200MB respectively during that timeframe. While I'm not an networking expect, some reading online seemed to indicate that all the traffic going through one NIC could be the main cause of limited network bandwidth. Is this true? Would hosting the site on a different type of EC2 instance help increase the network bandwidth? Here is how the networkIn and networkOut metrics looked like under heavy load.

networkIn and networkOut metrics chart

Answer

BobMcGee picture BobMcGee · Apr 2, 2016

If you were limited by bandwidth, that graph would become flat when you hit the limit. Further, as others pointed out that is only 1 MB/s out and 3 MB/s in, and I can do more than that on a t2.micro to the external internet.

What is the system doing with each request? Here are a list of things I would look at, in order:

  • Threading: are there bottlenecks in your application where only one thread can access a resource? That would keep CPU use low but cause exactly the pattern you saw.
  • Bad concurrency patterns in your application or server. Load test and look for it getting slower and slower as connections increase, while doing nothing.
  • Individual CPU: is one CPU loaded to 100% while others are mostly idle? (with 30+ cores, a saturated CPU would only give you 3% CPU use). One saturated CPU + others idle usually means a concurrency issue, probably in connection handling.
  • What is memory use like? Are you using swap at all? (that is a VERY bad sign if so, and would cause the issue). If memory use is excessive, often session storage in memory or excessively sized handler thread pools are at fault.
  • Disk I/O or external network requests: are you reading or writing with each request? vmstat will tell if you are spending a long time waiting for I/O to be serviced. If that is the case, I'd look at logging before anything.
    • The c4.8xlarge instances use EBS only, if the storage is magnetic and you write to access logs, you get a few hundred writes per second. General purpose SSDs give you 3 IO/s per GB base, but can burst to 3000 until they run out of IO credits.
    • The OS will try to combine writes, but with thousands of concurrent

It's not impossible but very unlikely that you might be bottlenecked at the network layer with connection creation or packets-per-second, if your requests are very small.