How to set timeout detection on a RabbitMQ server?

Unknown picture Unknown · Aug 28, 2009 · Viewed 12k times · Source

I am trying out RabbitMQ with this python binding.

One thing I noticed is that if I kill a consumer uncleanly (emulating a crashed program), the server will think that this consumer is still there for a long time. The result of this is that every other message will be ignored.

For example if you kill a consumer 1 time and reconnect, then 1/2 messages will be ignored. If you kill another consumer, then 2/3 messages will be ignored. If you kill a 3rd, then 3/4 messages will be ignored and so on.

I've tried turning on acknowledgments, but it doesn't seem to be helping. The only solution I have found is to manually stop the server and reset it.

Is there a better way?

How to recreate this scenario

  • Run rabbitmq.

  • Unarchive this library.

  • Download the consumer and publisher here. Run amqp_consumer.py twice. Run amqp_publisher.py, feeding in some data and observe that it works as expected. Messages are received round robin style.

  • Kill one of the consumer processes with kill -9 or task manager.

  • Now when you publish a message, 50% of the messages will be lost.

Answer

Tony Garnock-Jones picture Tony Garnock-Jones · Sep 6, 2009

I don't see amqp_consumer.py or amqp_producer.py in the tarball, so reproducing the fault is tricky.

RabbitMQ terminates connections, releasing their unacknowledged messages for redelivery to other clients, whenever it is told by the operating system that a socket has closed. Your symptoms are very strange, in that even a kill -9 ought to cause the TCP socket to be cleaned up properly.

Some people have noticed problems with sockets surviving longer than they should when running with a firewall or NAT device between the AMQP clients and the server. Could that be an issue here, or are you running everything on localhost? Also, what operating system are you running the various components of the system on?

ETA: From your comment below, I am guessing that while you are running the server on Linux, you may be running the clients on Windows. If this is the case, then it could be that the Windows TCP driver is not closing the sockets correctly, which is different from the kill-9 behaviour on Unix. (On Unix, the kernel will properly close the TCP connections on any killed process.)

If that's the case, then the bad news is that RabbitMQ can only release resources when the socket is closed, so if the client operating system doesn't do that, there's nothing it can do. This is the same as almost every other TCP-based service out there.

The good news, though, is that AMQP supports a "heartbeat" option for exactly these cases, where the networking fabric is untrustworthy. You could try enabling heartbeats. When they're enabled, if the server doesn't receive any traffic within a configurable interval, it decides that the connection must be dead.

The bad news, however, is that I don't think py-amqplib supports heartbeats at the moment. Worth a try, though!