Websocket transport reliability (Socket.io data loss during reconnection)

igorpavlov picture igorpavlov · Dec 19, 2013 · Viewed 29.6k times · Source

Used

NodeJS, Socket.io

Problem

Imagine there are 2 users U1 & U2, connected to an app via Socket.io. The algorithm is the following:

  1. U1 completely loses Internet connection (ex. switches Internet off)
  2. U2 sends a message to U1.
  3. U1 does not receive the message yet, because the Internet is down
  4. Server detects U1 disconnection by heartbeat timeout
  5. U1 reconnects to socket.io
  6. U1 never receives the message from U2 - it is lost on Step 4 I guess.

Possible explanation

I think I understand why it happens:

  • on Step 4 Server kills socket instance and the queue of messages to U1 as well
  • Moreover on Step 5 U1 and Server create new connection (it is not reused), so even if message is still queued, the previous connection is lost anyway.

Need help

How can I prevent this kind of data loss? I have to use hearbeats, because I do not people hang in app forever. Also I must still give a possibility to reconnect, because when I deploy a new version of app I want zero downtime.

P.S. The thing I call "message" is not just a text message I can store in database, but valuable system message, which delivery must be guaranteed, or UI screws up.

Thanks!


Addition 1

I do already have a user account system. Moreover, my application is already complex. Adding offline/online statuses won't help, because I already have this kind of stuff. The problem is different.

Check out step 2. On this step we technically cannot say if U1 goes offline, he just loses connection lets say for 2 seconds, probably because of bad internet. So U2 sends him a message, but U1 doesn't receive it because internet is still down for him (step 3). Step 4 is needed to detect offline users, lets say, the timeout is 60 seconds. Eventually in another 10 seconds internet connection for U1 is up and he reconnects to socket.io. But the message from U2 is lost in space because on server U1 was disconnected by timeout.

That is the problem, I wan't 100% delivery.


Solution

  1. Collect an emit (emit name and data) in {} user, identified by random emitID. Send emit
  2. Confirm the emit on client side (send emit back to server with emitID)
  3. If confirmed - delete object from {} identified by emitID
  4. If user reconnected - check {} for this user and loop through it executing Step 1 for each object in {}
  5. When disconnected or/and connected flush {} for user if necessary
// Server
const pendingEmits = {};

socket.on('reconnection', () => resendAllPendingLimits);
socket.on('confirm', (emitID) => { delete(pendingEmits[emitID]); });

// Client
socket.on('something', () => {
    socket.emit('confirm', emitID);
});

Solution 2 (kinda)

Added 1 Feb 2020.

While this is not really a solution for Websockets, someone may still find it handy. We migrated from Websockets to SSE + Ajax. SSE allows you to connect from a client to keep a persistent TCP connection and receive messages from a server in realtime. To send messages from a client to a server - simply use Ajax. There are disadvantages like latency and overhead, but SSE guarantees reliability because it is a TCP connection.

Since we use Express we use this library for SSE https://github.com/dpskvn/express-sse, but you can choose the one that fits you.

SSE is not supported in IE and most Edge versions, so you would need a polyfill: https://github.com/Yaffle/EventSource.

Answer

Michelle Tilley picture Michelle Tilley · Dec 28, 2013

Others have hinted at this in other answers and comments, but the root problem is that Socket.IO is just a delivery mechanism, and you cannot depend on it alone for reliable delivery. The only person who knows for sure that a message has been successfully delivered to the client is the client itself. For this kind of system, I would recommend making the following assertions:

  1. Messages aren't sent directly to clients; instead, they get sent to the server and stored in some kind of data store.
  2. Clients are responsible for asking "what did I miss" when they reconnect, and will query the stored messages in the data store to update their state.
  3. If a message is sent to the server while the recipient client is connected, that message will be sent in real time to the client.

Of course, depending on your application's needs, you can tune pieces of this--for example, you can use, say, a Redis list or sorted set for the messages, and clear them out if you know for a fact a client is up to date.


Here are a couple of examples:

Happy path:

  • U1 and U2 are both connected to the system.
  • U2 sends a message to the server that U1 should receive.
  • The server stores the message in some kind of persistent store, marking it for U1 with some kind of timestamp or sequential ID.
  • The server sends the message to U1 via Socket.IO.
  • U1's client confirms (perhaps via a Socket.IO callback) that it received the message.
  • The server deletes the persisted message from the data store.

Offline path:

  • U1 looses internet connectivity.
  • U2 sends a message to the server that U1 should receive.
  • The server stores the message in some kind of persistent store, marking it for U1 with some kind of timestamp or sequential ID.
  • The server sends the message to U1 via Socket.IO.
  • U1's client does not confirm receipt, because they are offline.
  • Perhaps U2 sends U1 a few more messages; they all get stored in the data store in the same fashion.
  • When U1 reconnects, it asks the server "The last message I saw was X / I have state X, what did I miss."
  • The server sends U1 all the messages it missed from the data store based on U1's request
  • U1's client confirms receipt and the server removes those messages from the data store.

If you absolutely want guaranteed delivery, then it's important to design your system in such a way that being connected doesn't actually matter, and that realtime delivery is simply a bonus; this almost always involves a data store of some kind. As user568109 mentioned in a comment, there are messaging systems that abstract away the storage and delivery of said messages, and it may be worth looking into such a prebuilt solution. (You will likely still have to write the Socket.IO integration yourself.)

If you're not interested in storing the messages in the database, you may be able to get away with storing them in a local array; the server tries to send U1 the message, and stores it in a list of "pending messages" until U1's client confirms that it received it. If the client is offline, then when it comes back it can tell the server "Hey I was disconnected, please send me anything I missed" and the server can iterate through those messages.

Luckily, Socket.IO provides a mechanism that allows a client to "respond" to a message that looks like native JS callbacks. Here is some pseudocode:

// server
pendingMessagesForSocket = [];

function sendMessage(message) {
  pendingMessagesForSocket.push(message);
  socket.emit('message', message, function() {
    pendingMessagesForSocket.remove(message);
  }
};

socket.on('reconnection', function(lastKnownMessage) {
  // you may want to make sure you resend them in order, or one at a time, etc.
  for (message in pendingMessagesForSocket since lastKnownMessage) {
    socket.emit('message', message, function() {
      pendingMessagesForSocket.remove(message);
    }
  }
});

// client
socket.on('connection', function() {
  if (previouslyConnected) {
    socket.emit('reconnection', lastKnownMessage);
  } else {
    // first connection; any further connections means we disconnected
    previouslyConnected = true;
  }
});

socket.on('message', function(data, callback) {
  // Do something with `data`
  lastKnownMessage = data;
  callback(); // confirm we received the message
});

This is quite similar to the last suggestion, simply without a persistent data store.


You may also be interested in the concept of event sourcing.