Recovering from Redis connectivity loss

Konstantin Surkov picture Konstantin Surkov · Jul 24, 2014 · Viewed 9.4k times · Source

I am looking for the reference implementation of the recovery [in multi-threaded environment] after Redis connection has been lost. Was not able to find anything that makes sense so far.

Setup: I have an Azure worker role that runs the same code (ThreadProc) in multiple threads. Initially, I had static ConnectionMultiplexer and did .GetDatabase() before every Redis operation. That didn't pass stress testing at all (tons of "can't connect" errors once the load increased from low to moderate). I have changed it to this:

static readonly ConnectionMultiplexer _connection = ConnectionMultiplexer.Connect(...);
static readonly IDatabase _cache = _connection.GetDatabase();

void ThreadProc() // running in multiple threads
{
    while (true)
    {
      // using _cache here
    }
}

That worked pretty well even under high load (1000+ op/s per worker role instance) until I got "No connection is available to service this operation", and the thing can't recover since.

Please let me know what would be the correct/recommended code that can recover from intermittent connectivity problems.

Answer

Ken Smith picture Ken Smith · Aug 5, 2014

Edit 2015-05-02: Although the later versions of the StackExchange.Redis client are explicitly supposed to handle this sort of "lost connection" -> "reconnect" logic internally and automagically, my testing has shown beyond any real doubt that they don't manage to pull it off successfully, and hence this sort of thing is still needed, at least in busy environments. I yanked the code below out of my caching layer for a while, and ended up with tens of thousands of connection failure errors. I put it back in, and those all went away.

Edit 2015-02-24: This method should no longer be necessary. The latest versions of the StackExchange.Redis client correctly handle being disconnected - they'll reconnect themselves up automatically, and the workaround below just interferes with things. Keeping it here for historical purposes, but my recommendation is to ignore it.


Here's a few methods from my SimpleCacheRedis<T> wrapper that show how I'm handling the problem:

public async Task<TValue> GetAsync(string key, Func<Task<TValue>> missingFunc)
{
    key = GetKey(key);
    var value = default(TValue);
    try
    {
        var db = _connection.GetDatabase();
        var str = await db.StringGetAsync(key);
        if (!str.IsNullOrEmpty)
        {
            value = _jsonSerializer.Deserialize<TValue>(str);
        }
    }
    catch (RedisConnectionException ex)
    {
        HandleRedisConnectionError(ex);
    }
    catch (Exception ex)
    {
        _logger.Error("Error retrieving item '" + key +
                      "' from Redis cache; falling back to missingFunc(). Error = " + ex);
    }
    if (value == default(TValue))
    {
        present = false;
        value = await missingFunc();
        await PerformAddAsync(key, value);
    }
    return value;
}

private void HandleRedisConnectionError(RedisConnectionException ex)
{
    _logger.Error("Connection error with Redis cache; recreating connection for the next try, and falling back to missingFunc() for this one. Error = " + ex.Message);
    Task.Run(async () =>
    {
        try
        {
            await CreateConnectionAsync();
        }
        catch (Exception genEx)
        {
            _logger.Error("Unable to recreate redis connection (sigh); bailing for now: " + genEx.Message);
        }
    });
}

private async Task CreateConnectionAsync()
{
    if (_attemptingToConnect) return;
    var sw = new StringWriter();
    try
    {
        _attemptingToConnect = true;
        _connection = await ConnectionMultiplexer.ConnectAsync(_redisCs, sw);
    }
    catch (Exception ex)
    {
        _logger.Error("Unable to connect to redis async: " + ex);
        _logger.Debug("internal log: \r\n" + sw);
        throw;
    }
    finally
    {
        _attemptingToConnect = false;
    }
}

Basically, if I see that I can't connect to Redis because of a RedisConnectionException, I spin off a separate async task to recreate the shared connection. That call, of course, may fail, but calls were going to be failing during that period anyway. And as soon as it succeeds, any new calls will use that newly (re-)created connection. Like I said above, kinda boring.

My situation may be a little different from yours, in that I'm not using Redis as a permanent store, but just as a cache. That means the only impact from a lost redis connection is that I need to retrieve the result from the DB rather than from cache. So I can afford to treat certain things a little loosely.