My Zookeeper is controlling a few different queues for different jobs, by holding the relevant job data in each node until the computer is ready to process. If I stop the overall service, such that no jobs can be started ZooKeeper runs just fine after a restart. However, some of these jobs seem to cause ZooKeeper to crash with the following message in the ZooKeeper log:
WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@349] - caught end of stream exception
EndOfStreamException: Unable to read additional data from client sessionid 0x15677f740ad002a, likely client has closed socket
at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:220)
at org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208)
at java.lang.Thread.run(Thread.java:745)
INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1001] - Closed socket connection for client /127.0.0.1:46998 which had sessionid 0x15677f740ad002a
My ZooKeeper knowledge is very limited, as I am taking over from the guy that set it up originally.
I have tried to delete a lot of nodes with rmr [path]
in the zookeeper shell, which seemed to have some effect (deleted 50k+ nodes that was left over/of no use), but it has kept crashing daily, and last night I couldn't get it to run for more than a couple of minutes before the same error/crash would occur.
How do I find out what is causing this?
I am pretty sure it is some general problem with the data that is recieved, or the stored data/nodes. The disk is only 92% full. I also found this post: Zookeeper keeps getting the WARN: "caught end of stream exception", but the solution doesn't make much sense to me. Also I am pretty sure that none of the messages kept in my znodes are more than 1MB large, but I am unsure how to confirm this.
Is there some way I can change the ZooKeeper log so that I can print additional information, such as the content/name of the znode it is operating on before it crashes?
I was able to solve the problem by deleting all zookeeper snapshots and log files from the server running ZooKeeper. I don't know why this made a difference, but it has been running fine for the last 22 hours.