Few questions about RabbitMQ v3.1.5 clustering. I have a cluster with 2 nodes, rabbitmq.config is like this on both nodes:
[
{rabbit, [
{cluster_nodes, {['rabbit@rmq01', 'rabbit@rmq02'], ram}},
{tcp_listeners, [5674]}
]}
].
I already seen issue like this, and now I'm watching it again: When sometimes all cluster is shutting down, in case second node (rmq02) starts before first (rmq01), it 'forgets' about rmq01:
[root@rmq2 rabbitmq]# rabbitmqctl cluster_status
Cluster status of node 'rabbit@rmq2' ...
[{nodes,[{disc,['rabbit@rmq2']}]},
{running_nodes,['rabbit@rmq2']},
{partitions,[]}]
...done.
After this first node (rmq01) can not start due to rmq2 disagrees about clustering:
{"init terminating in do_boot",{rabbit,failure_during_boot,{error,{inconsistent_cluster,"Node 'rabbit@rmq1' thinks it's clustered with node 'rabbit@rmq2', but 'rabbit@rmq2' disagrees"}}}}
I've tried to add rmq01 to rmq02, but seems I have to stop_app before this:
[root@rmq2 rabbitmq]# rabbitmqctl join_cluster rabbit@rmq1
Clustering node 'rabbit@rmq2' with 'rabbit@rmq1' ...
Error: mnesia_unexpectedly_running
Here I see that rmq02 forgot about rmq01:
[root@rmq2 ~]# cat /var/lib/rabbitmq/mnesia/rabbit\@rmq2/cluster_nodes.config
{['rabbit@rmq2'],['rabbit@rmq2']}.
Meanwhile on rmq01 (correct configuration):
[root@rmq1 ~]# cat /var/lib/rabbitmq/mnesia/rabbit\@rmq1/cluster_nodes.config
{['rabbit@rmq1','rabbit@rmq2'],['rabbit@rmq1']}.
Questions:
I've found way to resolve question #2, to fix up cluster health with no downtime, we need to remove all mnesia data on inconsistent node:
[root@rmq01 ~]# rm -rf /var/lib/rabbitmq/mnesia/
[root@rmq01 ~]# service rabbitmq-server start
Starting rabbitmq-server: SUCCESS
rabbitmq-server.
[root@rmq01 ~]# rabbitmqctl cluster_status
Cluster status of node 'rabbit@rmq01' ...
[{nodes,[{disc,['rabbit@rmq02']},{ram,['rabbit@rmq01']}]},
{running_nodes,['rabbit@rmq02','rabbit@rmq01']},
{partitions,[]}]
...done.
I still do not understand how to avoid this scenario (question #1), maybe some mnesia customisations will help.