Multi-AZ RDS test failover and connection monitoring

amparito picture amparito · Mar 8, 2017 · Viewed 7.2k times · Source

My question has two parts:

  1. What is the best way to initiate an RDS failover for testing purposes?
  2. How can I monitor the connection during failover in order to observe the time that it takes for AWS to reconnect the user to the standby instance?

With respect to part (1): If I understand correctly, all instance modifications are made on the standby and then AWS fails over by flipping the CNAME over to the standby as the primary is updated, so if I were to make any kind of instance modification and select "apply immediately," it should cause a failover, correct?

With respect to part (2): I am looking specifically for a way of monitoring the failover of an Oracle RDS instance, whether through a lambda function, a bash script, or some other means. As far as I can tell, it is not possible to use ping with RDS, even when I allow all ICMP traffic via the security group. I can connect without trouble using telnet or an SQL client. What I would like though is some way of doing something like periodically pinging the database during a failover to see when the IP associated with the connection string switches over and how long it takes. Any suggestions?

Answer

Anthony Neace picture Anthony Neace · Mar 8, 2017
  1. Correct, RDS will make your modifications on the failover instance and then failover to it. Per their documentation:

The availability benefits of Multi-AZ deployments also extend to planned maintenance and backups. In the case of system upgrades like OS patching or DB Instance scaling, these operations are applied first on the standby, prior to the automatic failover. As a result, your availability impact is, again, only the time required for automatic failover to complete.

To simulate failover, simply reboot with failover when rebooting, instead of rebooting both. From the linked documentation:

Reboot with failover is beneficial when you want to simulate a failure of a DB instance for testing, or restore operations to the original AZ after a failover occurs.

  1. Write a script that, on a regular interval, connects with a SQL Client and performs a quick select on a table of your preference. You can use this to measure true downtime during the failover; we have a tool very similar to this that we use when getting estimates of modifications on a test RDS before we apply it to our production RDS. Our tool simply writes to console with a timestamp and whether it failed/succeeded every few seconds. The tool will write success before the reboot, failure during, and success again after the cutover completes.

Additional Resources: