zookeeper server not running

NotGaeL picture NotGaeL · Apr 1, 2016 · Viewed 13k times · Source

I'm trying to start an hbase master from ambari.

It can't start it because it can't connect to zookeper server.

Ambari marks all the zookeper servers (3 nodes) as running.

The application server (tomcat¿?) that runs the zookeper server application seems to be running fine; At least there is a service listening on the specified port.

But the application is not able to connect to the other nodes and it seems like it doesn't start.

All the connections are closed with the error message ZooKeeperServer not running on zookeeper server log, and zookeeper.ClientCnxn: Unable to read additional data from server sessionid 0x0, likely server has closed socket on the client.

This is the zookeper server log output for those nodes (same log for all of them, only the node names change):

2016-03-31 16:15:34,550 - INFO  [main:QuorumPeerConfig@103] - Reading configuration from: /usr/hdp/current/zookeeper-server/conf/zoo.cfg
2016-03-31 16:15:34,553 - INFO  [main:QuorumPeerConfig@338] - Defaulting to majority quorums
2016-03-31 16:15:34,557 - INFO  [main:DatadirCleanupManager@78] - autopurge.snapRetainCount set to 30
2016-03-31 16:15:34,557 - INFO  [main:DatadirCleanupManager@79] - autopurge.purgeInterval set to 24
2016-03-31 16:15:34,558 - INFO  [PurgeTask:DatadirCleanupManager$PurgeTask@138] - Purge task started.
2016-03-31 16:15:34,565 - INFO  [PurgeTask:DatadirCleanupManager$PurgeTask@144] - Purge task completed.
2016-03-31 16:15:34,566 - INFO  [main:QuorumPeerMain@127] - Starting quorum peer
2016-03-31 16:15:34,573 - INFO  [main:NIOServerCnxnFactory@94] - binding to port 0.0.0.0/0.0.0.0:2181
2016-03-31 16:15:34,582 - INFO  [main:QuorumPeer@992] - tickTime set to 2000
2016-03-31 16:15:34,582 - INFO  [main:QuorumPeer@1012] - minSessionTimeout set to -1
2016-03-31 16:15:34,582 - INFO  [main:QuorumPeer@1023] - maxSessionTimeout set to -1
2016-03-31 16:15:34,582 - INFO  [main:QuorumPeer@1038] - initLimit set to 10
2016-03-31 16:15:34,598 - INFO  [Thread-2:QuorumCnxManager$Listener@506] - My election bind port: sg1.imatiasl.lan/127.0.0.1:3888
2016-03-31 16:15:34,607 - INFO  [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:QuorumPeer@747] - LOOKING
2016-03-31 16:15:34,608 - INFO  [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@815] - New election. My id =  1, proposed zxid=0x0
2016-03-31 16:15:34,609 - INFO  [WorkerReceiver[myid=1]:FastLeaderElection@597] - Notification: 1 (message format version), 1 (n.leader), 0x0 (n.zxid), 0x1 (
n.round), LOOKING (n.state), 1 (n.sid), 0x0 (n.peerEpoch) LOOKING (my state)
2016-03-31 16:15:34,612 - WARN  [WorkerSender[myid=1]:QuorumCnxManager@383] - Cannot open channel to 2 at election address sg2.imatiasl.lan/10.7.0.93:3888
java.net.ConnectException: Connection refused
        at java.net.PlainSocketImpl.socketConnect(Native Method)
        at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:345)
        at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
        at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
        at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
        at java.net.Socket.connect(Socket.java:589)
        at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:368)
        at org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:341)
        at org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.process(FastLeaderElection.java:449)
        at org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.run(FastLeaderElection.java:430)
        at java.lang.Thread.run(Thread.java:745)
2016-03-31 16:15:34,614 - WARN  [WorkerSender[myid=1]:QuorumCnxManager@383] - Cannot open channel to 3 at election address sg3.imatiasl.lan/10.7.0.94:3888
java.net.ConnectException: Conexión rehusada
        at java.net.PlainSocketImpl.socketConnect(Native Method)
        at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:345)
        at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
        at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
        at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
        at java.net.Socket.connect(Socket.java:589)
        at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:368)
        at org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:341)
        at org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.process(FastLeaderElection.java:449)
        at org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.run(FastLeaderElection.java:430)
        at java.lang.Thread.run(Thread.java:745)
2016-03-31 16:15:34,812 - WARN  [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:QuorumCnxManager@383] - Cannot open channel to 2 at election address sg2.imatiasl.la
n/10.7.0.93:3888
java.net.ConnectException: Connection refused
        at java.net.PlainSocketImpl.socketConnect(Native Method)
        at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:345)
        at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
        at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
        at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
        at java.net.Socket.connect(Socket.java:589)
        at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:368)
        at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectAll(QuorumCnxManager.java:404)
        at org.apache.zookeeper.server.quorum.FastLeaderElection.lookForLeader(FastLeaderElection.java:840)
        at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:795)
2016-03-31 16:15:34,813 - WARN  [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:QuorumCnxManager@383] - Cannot open channel to 3 at election address sg3.imatiasl.la
n/10.7.0.94:3888
java.net.ConnectException: Connection refused
        at java.net.PlainSocketImpl.socketConnect(Native Method)
        at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:345)
        at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
        at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
        at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
        at java.net.Socket.connect(Socket.java:589)
        at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:368)
        at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectAll(QuorumCnxManager.java:404)
        at org.apache.zookeeper.server.quorum.FastLeaderElection.lookForLeader(FastLeaderElection.java:840)
        at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:795)
2016-03-31 16:15:34,813 - INFO  [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] - Notification time out: 400

When the client tries to connect:

2016-03-31 16:15:35,086 - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - Accepted socket connection from /10.7.0.93:55914
2016-03-31 16:15:35,130 - WARN  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception causing close of session 0x0 due to java.io.IOExcep
tion: ZooKeeperServer not running
2016-03-31 16:15:35,130 - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1007] - Closed socket connection for client /10.7.0.93:55914 (no ses
sion established for client)

And so on...

Any ideas on how to fix this?

Answer

Alfonso Nishikawa picture Alfonso Nishikawa · Apr 4, 2016

Your election port is being binded at sgX.imatiasl.lan/127.0.0.1:3888 for all nodes, so when the clients try to connect to sgY.imatiasl.lan/10.7.0.93:3888 it fails.

The election ports should bind to 0.0.0.0:3888 or the real IP of each node, but for some reason they are being resolved to 127.0.0.1. You can check the IP:port in each node with netstat -patun to confirm this.

Much probably you have some issue with /etc/hosts. Take a look at: https://unix.stackexchange.com/questions/240506/zookeeper-dns-name-problems-with-leader-elections-when-migrating-from-windows-to