How do I use ELB's HealthyHostCount for monitoring in CloudWatch?

awendt picture awendt · Jul 23, 2012 · Viewed 12.8k times · Source

We have three EC2 instances—one in each availability zone (AZ) in the eu-west-1 region. They are loadbalanced using ELB. We'd like to monitor how many instances are registered at the loadbalancer, using CloudWatch. The problem ist: I don't really understand the HealthyHostCount metric.

For a deployment, we'd like to be able to de-register a single instance (take it out of the LB) without being notified. So the alarm would be: Notify if there is only 1 healthy instance left behind the loadbalancer for 5 minutes.

As far as I understand, HealthyHostCount (HHC) is the number of healthy instances that are registered with a given ELB, averaged over all AZs. If everything is okay, the HHC should be 1 (no matter over what period of time) because there is 1 instance in each AZ.

A couple of days ago, someone deployed without re-registering the instances, so there was only 1 instance being balanced. When we noticed that, we created an alarm that was to notify us when the average HHC sunk below 0.6 after 5 minutes. (If only 1 instance is registered in ELB, the HHC should average 0.33 for any period of time.) However, the alarm never changed to state "ALARM."

When I checked the HHC in CloudWatch, the HHC were numbers that didn't make sense (sum of 10.0 for a 5-minute interval is all I remember now).

It's all a big mess to me. Any time I think I understand the metric, the CloudWatch charts are all gibberish to me.

Could someone please explain how to use HHC to get an alarm when only 1 instance is registered? Is average HHC the way to go or should I use another metric?

Answer

Gerardo Grignoli picture Gerardo Grignoli · May 9, 2014

The HealthyHostCount metric records one data value with the count of available hosts for each availability zone, each time a health check is executed. Your ELB health check has an Interval parameter that defines how many health checks are executed per minute.

If you are watching a Per-AZ metric, with a health check Interval of 10 seconds, with 2 healthy hosts in that AZ, you will see 6 data points per minute (60/10) with a value of 2. The average, max and min will be 2, but the sum will be 6*2=12.

If you have 3 AZs with 2 hosts each, again with an Interval=10, but you are looking at the Per-LB metric, you will see 3*6=18 data points per minute, each one with a value of 2. The average, max and min will be 2, but the sum will be 18*2=36

I recommend you to set-up an interval value that can divide 60 seconds (either 5, 6, 10, 15, 20, 30 or 60 seconds).

In your case, if your interval is 30 seconds, and you have 3 AZs and 1 server per AZ: You should expect 2 data points per AZ per minute, so set-up an alarm Per-LB, with a Period of 1 minute, for Sum of HealthyHostCount that triggers when value is LowerOrEqual than 2 (2 data values * 1 Healthy AZ * 1 healthy server = 2, the other 4 data values of the unhealthy AZs should be 0 so they won't affect the sum).

UPDATE:

It turns out that the number of health check executed also depends on the number of internal instances that shapes the ELB (ussually one per AZ), so if you are suffering a traffic spike, or enough load to saturate a single elb-internal-instance, the amount of internal servers inside the ELB will grow and you will have more data points unexpectedly. This may affect the sum value, only if you have lots of traffic. I didn't saw this issue with a peak load of 6k RPM distributed in 3 AZs. If this is your scenario, then using average is a safer bet, but I would recommend that you use LowerThan 0.65 as your threshold.

The link also makes me wonder how does the Cross-Zone Load Balancing feature affects the amount of data points...