We have the following setup:
Redis 2.6 on Ubuntu Linux 12.04LTE on a RackspaceCloud 8GB instance with the following settings:
daemonize yes
pidfile /var/run/redis_6379.pid
port 6379
timeout 300
loglevel notice
logfile /var/log/redis_6379.log
databases 16
save 900 1
save 300 10
save 60 10000
rdbcompression yes
dbfilename dump.rdb
dir /var/redis/6379
requirepass PASSWORD
maxclients 10000
maxmemory 7gb
maxmemory-policy allkeys-lru
maxmemory-samples 3
appendonly no
slowlog-log-slower-than 10000
slowlog-max-len 128
activerehashing yes
Our App servers are hosted in RackSpace Managed and connect to the Redis via public IP (to avoid having to set up RackSpace Connect, which is a royal PITA), and we provide some security by requiring a password for the Redis connection. I manually increased unix file descriptor limits to 10240, max of 10k connections should offer enough headroom. As you can see from the settings file above, I limit memory usage to 7GB to leave some RAM headroom as well.
We use the ServiceStack C# Redis Driver. We use the following web.config settings:
<RedisConfig suffix="">
<Primary password="PASSWORD" host="HOST" port="6379" maxReadPoolSize="50" maxWritePoolSize="50"/>
</RedisConfig>
We have a PooledRedisClientManager singleton, created once per AppPool as follows:
private static PooledRedisClientManager _clientManager;
public static PooledRedisClientManager ClientManager
{
get
{
if (_clientManager == null)
{
try
{
var poolConfig = new RedisClientManagerConfig
{
MaxReadPoolSize = RedisConfig.Config.Primary.MaxReadPoolSize,
MaxWritePoolSize = RedisConfig.Config.Primary.MaxWritePoolSize,
};
_clientManager = new PooledRedisClientManager(new List<string>() { RedisConfig.Config.Primary.ToHost() }, null, poolConfig);
}
catch (Exception e)
{
log.Fatal("Could not spin up Redis", e);
CacheFailed = DateTime.Now;
}
}
return _clientManager;
}
}
And we acquire a connection and do put/get operations as follows:
using (var client = ClientManager.GetClient())
{
client.Set<T>(region + key, value);
}
Code seems to mostly work. Given that we have ~20 AppPools and 50-100 read and 50-100 write clients we expect 2000-4000 connections to the Redis server at the most. However, we keep seeing the following exception in our error logs, usually a couple hundred of those bunched together, nothing for an hour, and over again, ad nauseum.
System.IO.IOException: Unable to read data from the transport connection:
An existing connection was forcibly closed by the remote host.
---> System.Net.Sockets.SocketException: An existing connection was forcibly closed by the remote host at
System.Net.Sockets.Socket.Receive(Byte[] buffer, Int32 offset, Int32 size, SocketFlags socketFlags) at
System.Net.Sockets.NetworkStream.Read(Byte[] buffer, Int32 offset, Int32 size)
--- End of inner exception stack trace
- at System.Net.Sockets.NetworkStream.Read(Byte[] buffer, Int32 offset, Int32 size) at System.IO.BufferedStream.ReadByte() at
ServiceStack.Redis.RedisNativeClient.ReadLine() in C:\src\ServiceStack.Redis\src\ServiceStack.Redis\RedisNativeClient_Utils.cs:line 85 at
ServiceStack.Redis.RedisNativeClient.SendExpectData(Byte[][] cmdWithBinaryArgs) in C:\src\ServiceStack.Redis\src\ServiceStack.Redis\RedisNativeClient_Utils.cs:line 355 at
ServiceStack.Redis.RedisNativeClient.GetBytes(String key) in C:\src\ServiceStack.Redis\src\ServiceStack.Redis\RedisNativeClient.cs:line 404 at ServiceStack.Redis.RedisClient.GetValue(String key) in C:\src\ServiceStack.Redis\src\ServiceStack.Redis\RedisClient.cs:line 185 at ServiceStack.Redis.RedisClient.Get[T](String key) in C:\src\ServiceStack.Redis\src\ServiceStack.Redis\RedisClient.ICacheClient.cs:line 32 at DataPeaks.NoSQL.RedisCacheClient.Get[T](String key) in c:\dev\base\branches\currentversion\DataPeaks\DataPeaks.NoSQL\RedisCacheClient.cs:line 96
We have experimented with a Redis Server Timeout of 0 (i.e. NO connection timeout), a timeout of 24 hours, and in between, without luck. Googling and Stackoverflowing has brought no real answers, everything seems to point to us doing the right thing with the code at least.
Our feeling is that we get regular sustained network latency issues beetwen Rackspace Hosted and Rackspace Cloud, which cause a block of TCP connections to go stale. We could possibly solve that by implementing Client side connection timeouts, and the question would be whether we'd need server side timeouts as well. But that's just a feeling, and we're not 100% sure we're on the right track.
Ideas?
Edit: I occasionally see the following error as well:
ServiceStack.Redis.RedisException: Unable to Connect: sPort: 65025 ---> System.Net.Sockets.SocketException: An existing connection was forcibly closed by the remote host at System.Net.Sockets.Socket.Send(IList`1 buffers, SocketFlags socketFlags) at ServiceStack.Redis.RedisNativeClient.FlushSendBuffer() in C:\src\ServiceStack.Redis\src\ServiceStack.Redis\RedisNativeClient_Utils.cs:line 273 at ServiceStack.Redis.RedisNativeClient.SendCommand(Byte[][] cmdWithBinaryArgs) in C:\src\ServiceStack.Redis\src\ServiceStack.Redis\RedisNativeClient_Utils.cs:line 203 --- End of inner exception stack trace --- at ServiceStack.Redis.RedisNativeClient.CreateConnectionError() in C:\src\ServiceStack.Redis\src\ServiceStack.Redis\RedisNativeClient_Utils.cs:line 165 at ServiceStack.Redis.RedisNativeClient.SendExpectData(Byte[][] cmdWithBinaryArgs) in C:\src\ServiceStack.Redis\src\ServiceStack.Redis\RedisNativeClient_Utils.cs:line 355 at ServiceStack.Redis.RedisNativeClient.GetBytes(String key) in C:\src\ServiceStack.Redis\src\ServiceStack.Redis\RedisNativeClient.cs:line 404 at ServiceStack.Redis.RedisClient.GetValue(String key) in C:\src\ServiceStack.Redis\src\ServiceStack.Redis\RedisClient.cs:line 185 at ServiceStack.Redis.RedisClient.Get[T](String key) in C:\src\ServiceStack.Redis\src\ServiceStack.Redis\RedisClient.ICacheClient.cs:line 32 at DataPeaks.NoSQL.RedisCacheClient.Get[T](String key) in c:\dev\base\branches\currentversion\DataPeaks\DataPeaks.NoSQL\RedisCacheClient.cs:line 96
I imagine this is a direct result of having server-side connection timeouts that aren't handled on the client. It's looking like we really need to be handling client-side connection timeouts.
We think we found the root cause after carefully reading through the Redis documentation and finding this beauty (http://redis.io/topics/persistence):
RDB needs to fork() often in order to persist on disk using a child process.
Fork() can be time consuming if the dataset is big, and may result in Redis
to stop serving clients for some millisecond or even for one second if the
dataset is very big and the CPU performance not great. AOF also needs to fork()
but you can tune how often you want to rewrite your logs without any trade-off
on durability.
We turned RDB persistence off, and haven't seen those connection drops since.