Why can't my Zookeeper server rejoin the Quorum?

fpearsall picture fpearsall · Mar 3, 2014 · Viewed 7.8k times · Source

I have three servers in my quorum. They are running ZooKeeper 3.4.5. Two of them appear to be running fine based on the output from mntr. One of them was restarted a couple days ago due to a deploy, and since then has not been able to join the quorum. Some lines in the logs that stick out are:

2014-03-03 18:44:40,995 [myid:1] - INFO  [main:QuorumPeer@429] - currentEpoch not found! Creating with a reasonable default of 0. This should only happen when you are upgrading your installation

and:

2014-03-03 18:44:41,233 [myid:1] - INFO  [QuorumPeer[myid=1]/0.0.0.0:2181:QuorumCnxManager@190] - Have smaller server identifier, so dropping the connection: (2, 1)
2014-03-03 18:44:41,234 [myid:1] - INFO  [QuorumPeer[myid=1]/0.0.0.0:2181:QuorumCnxManager@190] - Have smaller server identifier, so dropping the connection: (3, 1)
2014-03-03 18:44:41,235 [myid:1] - INFO  [QuorumPeer[myid=1]/0.0.0.0:2181:FastLeaderElection@774] - Notification time out: 400

Googling for the first ('currentEpoch not found!') led me to JIRA ZOOKEEPER-1653 - zookeeper fails to start because of inconsistent epoch. It describes a bug fix but doesn't describe a way to resolve the issue without upgrading zookeeper.

Googling for the second ('Have smaller server identifier, so dropping the connection') led me to JIRA ZOOKEEPER-1506 - Re-try DNS hostname -> IP resolution if node connection fails. This makes sense because I am using AWS Elastic IPs for the servers. The fix for this issue seems to be to do a rolling restart, which would cause us to temporarily lose quorum.

It looks like the second issue is definitely in play because I see timeouts in the other ZooKeeper server's logs (the ones still in the quorum) when trying to connect to the first server. What I'm not sure of is if the first issue will disappear when I do a rolling restart. I would like to avoid upgrading and/or doing a rolling restart, but if I have to do a rolling restart I'd like to avoid doing it multiple times. Is there a way to fix the first issue without upgrading? Or even better: Is there a way to resolve both issues without doing a rolling restart?

Thanks for reading and for your help!

Answer

robbie picture robbie · Dec 23, 2019

This is a bug of zookeeper: Server is unable to join quorum after connection broken to other peers Restart the leader solves this issue.