How to configure connection failover for a PostgreSQL Hot Standby setup in a JavaEE application?

rzo1 picture rzo1 · Nov 30, 2016 · Viewed 9.1k times · Source

I have two Linux servers (A and B) with a PostgreSQL 9.5 database installed. I configured hot standby mode as described in the documentation. In this setup, A is configured as master, B in hot standby mode. This is working well and behaves as expected.

Now, I want to connect an independent Java EE application (running on a different machine) via Hibernate / JDBC via a TomEE datasource to this database setup.

The PostgreSQL driver documentation states, that multiple hosts can be specified in the jdbc connection url:

jdbc:postgresql://host1:port1,host2:port2/database

So my questions are:

  1. If A is down and B is switched manually to normal operation mode, is my application still able to proceed with database operation with a jdbc connection url as stated above?
  2. Do I have to configure other parameters / libraries?

Note: From various sources I learned, that PostgreSQL does not support automatic failover (unless third-party software is involved in the process - see comments below). For this reason, failover needs to performed manually, which is ok for this particular use-case.

EDIT-1:

I decided to test pgBouncer (as suggested in the comments) for a workaround. It works well for my use-case. I wrote a watchdog script, which automates the manual steps:

  1. Continously check, if A is still alive and listens for incoming connections.
  2. In case of failover, switch B to normal operation mode and let it become the new master and restart the service.
  3. Change the pgBouncer settings to point to B instead of A and restart the service.

However, I would be still interested, if anybody has experiences without third party software?

Answer

vanOekel picture vanOekel · Dec 3, 2016

In these kind of situations, it is probably best to test and measure.

I do not have 'hands-on' experience with the PostrgeSQL hot standby mode, but I have done database fail-over for a Java application.

First, test the claims on the PostgreSQL driver documentation page about the ?targetServerType=master parameter (mentioned at the bottom of page).
Write a small Java "PgHsm" class with a main-method that uses the PostgreSQL JDBC driver via DriverManager.getConnection and runs a simple update query.
It should use server A to do the update query. Stop PostgreSQL on server A, run PgHsm: it should fail to connect since server B is not a master.
Make server B the master, run PgHsm: it should run OK.

The datasource is backed by a database connection pool in TomEE. This page lists the ones available in TomEE. But not all database connection pools are equal and I now prefer HikariCP because, in my experience, it handles the "database down" scenario more predictably. See also the test with results on HikariCP's handling database down page.

Unfortunately, HikariCP uses JDBC's get/setNetworkTimeout to behave predictably and the PostgreSQL JDBC driver does not implement this (*). So to be sure that (JavaEE) application threads do not hang forever on a database action, you need to set the connectTimeout and socketTimeout JDBC driver options. Setting a socketTimeout is precarious as it automatically sets a time-limit for ALL queries to the database.

(*) Update: since version 42.2.x network timeouts are implemented.

The second test to perform involves updating the Java "PgHsm" class to use the database connection pool implementation of your choosing and start (at least) two threads that continually run simple update queries in a loop (in the loop a database connection is acquired from the pool and returned to the pool after commit/rollback). While you bring down server A and switch server B to "master" mode, monitor the exceptions logged by "PgHsm" and how long a thread waits/hangs on performing a database action.
The results from the tests can be used to update the JDBC driver options and pool settings. Focus on results where:

  • invalid connections are removed from the pool as soon as possible so that the application gets mostly valid connections from the pool
  • as few as possible application threads hang (for the shortest amount of time) when a database goes down

The second test relies on server A not being available so that connection test queries (performed by the database connection pool) fail. In the case where both servers remain available, but master and slave switch, a connection test query will not help and the database connection pool will provide the wrong (now read-only) database connections to the application. In that case, manual intervention is required. A "fail-over pattern" for HikariCP is described here (only available with option allowPoolSuspension described on the configuration page):

  • suspendPool()
  • softEvictConnections()
  • Wait until activeConnections goes to 0.
  • resumePool()

The third test will be with the JavaEE application and by now, you should have a good idea what problems to expect. It is not uncommon for applications to get updated after these kind of tests to improve handling "database down" scenarios (e.g. setting (default) query-timeouts). In your case, a "suspend, flush and resume database connection pool" feature (the pattern described above) to use during the manual failover would also be desirable.