Rabbit mq - Error while waiting for Mnesia tables

jeril picture jeril · Feb 26, 2020 · Viewed 9k times · Source

I have installed rabbitmq using helm chart on a kubernetes cluster. The rabbitmq pod keeps restarting. On inspecting the pod logs I get the below error

2020-02-26 04:42:31.582 [warning] <0.314.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,[rabbit_durable_queue]}
2020-02-26 04:42:31.582 [info] <0.314.0> Waiting for Mnesia tables for 30000 ms, 6 retries left

When I try to do kubectl describe pod I get this error

Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  data:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  data-rabbitmq-0
    ReadOnly:   false
  config-volume:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      rabbitmq-config
    Optional:  false
  healthchecks:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      rabbitmq-healthchecks
    Optional:  false
  rabbitmq-token-w74kb:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  rabbitmq-token-w74kb
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  beta.kubernetes.io/arch=amd64
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason     Age                      From                                               Message
  ----     ------     ----                     ----                                               -------
  Warning  Unhealthy  3m27s (x878 over 7h21m)  kubelet, gke-analytics-default-pool-918f5943-w0t0  Readiness probe failed: Timeout: 70 seconds ...
Checking health of node [email protected] ...
Status of node [email protected] ...
Error:
{:aborted, {:no_exists, [:rabbit_vhost, [{{:vhost, :"$1", :_, :_}, [], [:"$1"]}]]}}
Error:
{:aborted, {:no_exists, [:rabbit_vhost, [{{:vhost, :"$1", :_, :_}, [], [:"$1"]}]]}}

I have provisioned the above on Google Cloud on a kubernetes cluster. I am not sure during what specific situation it started failing. I had to restart the pod and since then it has been failing.

What is the issue here ?

Answer

Ulli picture Ulli · Mar 10, 2021

TLDR

helm upgrade rabbitmq --set clustering.forceBoot=true

Problem

The problem happens for the following reason:

  • All RMQ pods are terminated at the same time due to some reason (maybe because you explicitly set the StatefulSet replicas to 0, or something else)
  • One of them is the last one to stop (maybe just a tiny bit after the others). It stores this condition ("I'm standalone now") in its filesystem, which in k8s is the PersistentVolume(Claim). Let's say this pod is rabbitmq-1.
  • When you spin the StatefulSet back up, the pod rabbitmq-0 is always the first to start (see here).
  • During startup, pod rabbitmq-0 first checks whether it's supposed to run standalone. But as far as it can see on its own filesystem, it's part of a cluster. So it checks for its peers and doesn't find any. This results in a startup failure by default.
  • rabbitmq-0 thus never becomes ready.
  • rabbitmq-1 is never starting because that's how StatefulSets are deployed - one after another. If it were to start, it would start successfully because it sees that it can run standalone as well.

So in the end, it's a bit of a mismatch between how RabbitMQ and StatefulSets work. RMQ says: "if everything goes down, just start everything and the same time, one will be able to start and as soon as this one is up, the others can rejoin the cluster." k8s StatefulSets say: "starting everything all at once is not possible, we'll start with the 0".

Solution

To fix this, there is a force_boot command for rabbitmqctl which basically tells an instance to start standalone if it doesn't find any peers. How you can use this from Kubernetes depends on the Helm chart and container you're using. In the Bitnami Chart, which uses the Bitnami Docker image, there is a value clustering.forceBoot = true, which translates to an env variable RABBITMQ_FORCE_BOOT = yes in the container, which will then issue the above command for you.

But looking at the problem, you can also see why deleting PVCs will work (other answer). The pods will just all "forget" that they were part of a RMQ cluster the last time around, and happily start. I would prefer the above solution though, as no data is being lost.