Redis failover with StackExchange / Sentinel from C#

Paul picture Paul · Aug 19, 2014 · Viewed 8.4k times · Source

We're currently using Redis 2.8.4 and StackExchange.Redis (and loving it) but don't have any sort of protection against hardware failures etc at the moment. I'm trying to get the solution working whereby we have master/slaves and sentinel monitoring but can't quite get there and I'm unable to find any real pointers after searching.

So currently we have got this far:

We have 3 redis servers and sentinel on each node (setup by the Linux guys): devredis01:6383 (master) devredis02:6383 (slave) devredis03:6383 (slave) devredis01:26379 (sentinel) devredis02:26379 (sentinel) devredis03:26379 (sentinel)

I am able to connect the StackExchange client to the redis servers and write/read and verify that the data is being replicated across all redis instances using Redis Desktop Manager.

I can also connect to the sentinel services using a different ConnectionMultiplexer, query the config, ask for master redis node, ask for slaves etc.

We can also kill the master redis node and verify that one of the slaves is promoted to master and replication to the other slave continues to work. We can observe the redis connection trying to reconnect to the master, and also if I recreate the ConnectionMultiplexer I can write/read again to the newly promoted master and read from the slave.

So far so good!

The bit I'm missing is how do you bring it all together in a production system?

Should I be getting the redis endpoints from sentinel and using 2 ConnectionMultiplexers? What exactly do I have to do to detect that a node has gone down? Can StackExchange do this for me automatically or does it pass an event so I can reconnect my redis ConnectionMultiplexer? Should I handle the ConnectionFailed event and then reconnect in order for the ConnectionMuliplexer to find out what the new master is? Presumably while I am reconnecting any attempts to write will be lost?

I hope I'm not missing something very obvious here I'm just struggling to put it all together.

Thanks in advance!

Answer

Paul picture Paul · Sep 24, 2014

I was able to spend some time last week with the Linux guys testing scenarios and working on the C# side of this implementation and am using the following approach:

  • Read the sentinel addresses from config and create a ConnectionMultiplexer to connect to them
  • Subscribe to the +switch-master channel
  • Ask each sentinel server in turn what they think the master redis and slaves are, compare them all to make sure they all agree
  • Create a new ConnectionMultiplexer with the redis server addresses read from sentinel and connect, add event handler to ConnectionFailed and ConnectionRestored.
  • When I receive the +switch-master message I call Configure() on the redis ConnectionMultiplexer
  • As a belt and braces approach I always call Configure() on the redis ConnectionMultiplexer 12 seconds after receiving a connectionFailed or connectionRestored event when the connection type is ConnectionType.Interactive.

I find that generally I am working and reconfigured after about 5 seconds of losing the redis master. During this time I can't write but I can read (since you can read off a slave). 5 seconds is ok for us since our data updates very quickly and becomes stale after a few seconds (and is subsequently overwritten).

One thing I wasn't sure about was whether or not I should remove the redis server from the redis ConnectionMultiplexer when an instance goes down, or let it continue to retry the connection. I decided to leave it retrying as it comes back into the mix as a slave as soon as it comes back up. I did some performance testing with and without a connection being retried and it seemed to make little difference. Maybe someone can clarify whether this is the correct approach.

Every now and then bringing back an instance that was previously a master did seem to cause some confusion - a few seconds after it came back up I would receive an exception from writing - "READONLY" suggesting I can't write to a slave. This was rare but I found that my "catch-all" approach of calling Configure() 12 seconds after a connection state change caught this problem. Calling Configure() seems very cheap and therefore calling it twice regardless of whether or not it's necessary seemed OK.

Now that I have slaves I have offloaded some of my data cleanup code which does key scans to the slaves, which makes me happy.

All in all I'm pretty satisfied, it's not perfect but for something that should very rarely happen it's more than good enough.