I deployed a long running Storm topology. After several hours running, the whole topology went down. I checked worker logs, and found these logs . As it says, zookeeper client session timed out and it caused reconnection. I suspect it was relate to my broken topology. Now I try to find out what can cause clients timeout.
2016-02-29T10:34:12.386+0800 o.a.s.z.ClientCnxn [INFO] Client session timed out, have not heard from server in 23789ms for sessionid 0x252f862028c0083, closing socket connection and attempting reconnect
2016-02-29T10:34:12.986+0800 o.a.s.c.f.s.ConnectionStateManager [INFO] State change: SUSPENDED
2016-02-29T10:34:13.059+0800 b.s.cluster [WARN] Received event :disconnected::none: with disconnected Zookeeper.
2016-02-29T10:34:13.197+0800 o.a.s.z.ClientCnxn [INFO] Opening socket connection to server zk-3.cloud.mos/172.16.13.147:2181. Will not attempt to authenticate using SASL (unknown error)
2016-02-29T10:34:13.241+0800 o.a.s.z.ClientCnxn [WARN] Session 0x252f862028c0083 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) ~[na:1.8.0_31]
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:716) ~[na:1.8.0_31]
at org.apache.storm.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361) ~[storm-core-0.9.6.jar:0.9.6]
at org.apache.storm.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081) ~[storm-core-0.9.6.jar:0.9.6]
Your client can no longer talk to the ZooKeeper server. The first thing that happened was there was no answer to the heartbeats within the negotiated session timeout:
2016-02-29T10:34:12.386+0800 o.a.s.z.ClientCnxn [INFO] Client session timed out, have not heard from server in 23789ms for sessionid 0x252f862028c0083, closing socket connection and attempting reconnect
Then when it tried to reconnect, it got a connection refused:
2016-02-29T10:34:13.241+0800 o.a.s.z.ClientCnxn [WARN] Session 0x252f862028c0083 for server null, unexpected error, closing socket connection and attempting reconnect java.net.ConnectException: Connection refused
This means either your ZooKeeper server:
To tell more you will need to check the ZooKeeper server logs on your (Hadoop?) cluster.