RabbitMQ cluster is not reconnecting after network failure

Question 1

RabbitMQ cluster is not reconnecting after network failure

cluster-computing rabbitmq

Ranch · Dec 28, 2011 · Viewed 27.3k times · Source

Answer

Answer

RabbitMQ Clusters do not work well on unreliable networks (part of RabbitMQ documentation). So when the network failure happens (in a two node cluster) each node thinks that it is the master and the only node in the cluster. Two master nodes don't automatically reconnect, because their states are not automatically synchronized (even in case of a RabbitMQ slave - the actual message synchronization does not happen - the slave just "catches up" as messages get consumed from the queue and more messages get added).

To detect whether you have a broken cluster, run the command:

rabbitmqctl cluster_status

on each of the nodes that form part of the cluster. If the cluster is broken then you'll only see one node. Something like:

Cluster status of node rabbit@rabbitmq1 ...
[{nodes,[{disc,[rabbit@rabbitmq1]}]},{running_nodes,[rabbit@rabbitmq1]}]
...done.

In such cases, you'll need to run the following set of commands on one of the nodes that formed part of the original cluster (so that it joins the other master node (say rabbitmq1) in the cluster as a slave):

rabbitmqctl stop_app

rabbitmqctl reset

rabbitmqctl join_cluster rabbit@rabbitmq1

rabbitmqctl start_app

Finally check the cluster status again .. this time you should see both the nodes.

Note: If you have the RabbitMQ nodes in an HA configuration using a Virtual IP (and the clients are connecting to RabbitMQ using this virtual IP), then the node that should be made the master should be the one that has the Virtual IP.

Question 2

I have a RabbitMQ cluster with two nodes in production and the cluster is breaking with these error messages:

=ERROR REPORT==== 23-Dec-2011::04:21:34 ===
** Node rabbit@rabbitmq02 not responding **
** Removing (timedout) connection **

=INFO REPORT==== 23-Dec-2011::04:21:35 ===
node rabbit@rabbitmq02 lost 'rabbit'

=ERROR REPORT==== 23-Dec-2011::04:21:49 ===
Mnesia(rabbit@rabbitmq01): ** ERROR ** mnesia_event got {inconsistent_database, running_partitioned_network, rabbit@rabbitmq02}

I tried to simulate the problem by killing the connection between the two nodes using "tcpkill". The cluster has disconnected, and surprisingly the two nodes are not trying to reconnect!

When the cluster breaks, HAProxy load balancer still marks both nodes as active and send requests to both of them, although they are not in a cluster.

My questions:

If the nodes are configured to work as a cluster, when I get a network failure, why aren't they trying to reconnect afterwards?
How can I identify broken cluster and shutdown one of the nodes? I have consistency problems when working with the two nodes separately.

RabbitMQ cluster is not reconnecting after network failure

Answer

Related questions