How do I work out why an ECS health-check is failing?

amazon-web-services amazon-ec2 amazon-elb amazon-vpc amazon-ecs

Neil Trodden · Mar 13, 2017 · Viewed 8.3k times · Source

Outline:

I have a very simple ECS container which listens on port 5000 and writes out HelloWorld, plus the hostname of the instance it is running on. I want to deploy many of these containers using ECS and load balance them just to really learn more about how this works. And it is working to a certain extent but my health check is failing (time out) which is causing the containers tasks to be bounced up and down.

Current configuration:

1 VPC ( 10.0.0.0/19 )
1 Internet gateway
3 private subnets, one for each AZ in eu-west-1 (10.0.0.0/24, 10.0.1.0/24, 10.0.2.0/24)
3 public subnets, one for each AZ in eu-west-1 (10.0.10.0/24, 10.0.11.0/24, 10.0.12.0/24)
3 NAT instances, one in each of the public subnets, routing 0.0.0.0/0 to the Internet gateway and each assigned an Elastic IP
3 ECS instances, again one in each private subnet with a route to the NAT instance in the corresponding public subnet in the same AZ as the ECS instance
1 ALB load balancer (Internet facing) which is registered with my 3 public subnets
1 Target group (with no instances registered as per ECS documentation) but a health check set up on the 'traffic' port at /health
1 Service bringing up 3 tasks spread across AZs and using dynamic ports (which are then mapped to 5000 in the docker container)

Routing

Each private subnet has a rule to 10.0.0.0/19, and a default route for 0.0.0.0/0 to the NAT instance in public subnet in the same AZ as it.

Each public subnet has the same 10.0.0.0/19 route and a default route for 0.0.0.0/0 to the internet gateway.

Security groups

My instances are in a group that allows egress to anywhere and ingress on ports 32768 - 65535 from the security group the ALB is in.

The ALB is in a security group that allows ingress on port 80 only but egress to the security group my ECS instances are in on any port/protocol

What happens

When I bring all this up, it actually works - I can take the public dns record of the ALB and refresh and I see responses coming back to me from my container app telling me the hostname. This is exactly what I want to achieve however, it fails the health check and the container is drained, and replaced - with another one that fails the health check. This continues in a cycle, I have never seen a single successful health check.

What I've tried

Tweaked the health check intervals to make ECS require about 5 minutes of solid failed health-checks before killing the task. I thought this would eliminate it being a bit sensitive when the task starts up? This still goes on to trigger the tear-down, despite me being able to view the application running in my browser throughout.
Confirmed the /health url end point in a number of ways. I can retrieve it publicly via the ALB (as well as view the main app root url at '/') and curl tells me has a proper 200 OK response (which the health check is set to look for by default). I have ssh'ed into my ECS instances and performed a curl --head {url} on '/' and '/health' and both give a 200 OK response. I've even spun up another instance in the public subnet, granted it the same access as the ALB security group to my instances and been able to curl the health check from there.

Summary

I can view my application correctly load-balanced across AZs and private subnets on both its main url '/' and its health check url '/health' through the load balancer, from the ECS instance itself, and by using the instances private IP and port from another machine within the public subnet the ALB is in. The ECS service just cannot see this health check once without timing out. What on earth could I be missing??

Answer

For any that follow, I managed to break the app in my container accidentally and it was throwing a 500 error. Crucially though, the health check started reporting this 500 error -> therefore it was NOT a network timeout. Which means that when the health-check contacts the end point in my app, it was not handling the response properly and this appears to be a problem related to Nancy (the api framework I was using) and Go which sometimes reports "Client.Timeout exceeded while awaiting headers" and I am sure ECS is interpreting this as a network time-out. I'm going to tcpdump the network traffic and see what the health-check is sending and Nancy is responding and compare that to a container that works. Perhaps there is a Nancy fix or maybe ECS needs to not be so fussy.

edit:

By simply updating all the nuget packages that my Nancy app was using to the latest available and suddenly everything started working!