Tibco-Ems Failover Issue

DanielG picture DanielG · Sep 26, 2014 · Viewed 8.1k times · Source

I have 2 Tibco-Ems Servers running, with fault tolerant setup. If one server is not available, the active server switches to the failover server as expected.

However, every now and then I get strange errors. Then the new active server says: "reconnect failed: connection unknown for id= XY"

This only happens if there is an open connection on my client. But that's what I would expect, the connection should also switch to the new active server. And as I said, sometimes it works and sometimes not.

When I register for the EMS-Exceptions in my client, I get the error: "Unable to read data from the transport connection: An existing connection was forcibly closed by the remote host."

Stacktrace: at System.Net.Sockets.NetworkStream.Read(Byte[] buffer, Int32 offset, Int32 size) at TIBCO.EMS.LinkTcp._readEx(Byte[] buffer, Int32 offset, Int32 size) at TIBCO.EMS.LinkTcp._ReadWireMsg() at TIBCO.EMS.LinkTcp.LinkReader.Work()

Right now I have no more idea what I could do. Maybe somebody can help me to understand what the exact problem is. Thanks in Advance

UPDATE: A late update here: Even though I get the error "reconnect failed" it works as expected. The second server will take over.

Answer

nochum picture nochum · Oct 29, 2014

Here's what's going on... An EMS server keeps track of the active client connections that it has, and keeps information about these connections in the meta.db store file. Upon fault-tolerant failover the new primary EMS instance is able to recover the client connections when the clients reconnect by matching information that the client provides with information stored in the meta.db store file.

There is a point in time when EMS cleans up client connections that have not reconnected. That time is governed by the ft_reconnect_timeout parameter in the tibemsd.conf configuration file. The default setting for this configuration parameter is 60 seconds. Depending on your logging settings when EMS cleans up "expired" connections you may see a mssage indicating that it has "purged" a client connection in your EMS logs.

There are times when the client eventually does attempt to reconnect after the EMS server has already purged the "expired" connection. This can happen in the event that a network partition prevents the client from successfully reconnecting to the EMS server until after the EMS server cleans up the connection. When this happens you will see the, "Reconnect failed: connection unknown..." message.

When a client is unable to "re-connect" due to this error, it simply attempts a connection as a "new" connection. This works and it is able to continue processing.