Synchronizing table data across databases

java sql spring-batch

Serkan Arıkuşu · Mar 5, 2013 · Viewed 12.7k times · Source

I have one table that records its row insert/update timestamps on a field.

I want to synchronize data in this table with another table on another db server. Two db servers are not connected and synchronization is one way (master/slave). Using table triggers is not suitable

My workflow:

I use a global last_sync_date parameter and query table Master for the changed/inserted records
Output the resulting rows to xml
Parse the xml and update table Slave using updates and inserts

The complexity of the problem rises when dealing with deleted records of Master table. To catch the deleted records I think I have to maintain a log table for the previously inserted records and use sql "NOT IN". This becomes a performance problem when dealing with large datasets.

What would be an alternative workflow dealing with this scenario?

Answer

It sounds like you need a transactional message queue.

How this works is simple. When you update the master db you can send a message to the message broker (of whatever the update was) which can go to any number of queues. Each slave db can have its own queue and because queue's preserve order the process should eventually synchronize correctly (ironically this is sort of how most RDBMS do replication internally).

Think of the Message Queue as a sort of SCM change-list or patch-list database. That is for the most part the same (or roughly the same) SQL statements sent to master should be replicated to the other databases eventually. Don't worry about loosing messages as most message queues support durability and transactions.

I recommend you look at spring-amqp and/or spring-integration especially since you tagged this question with spring-batch.

Based on your comments:

See Spring Integration: http://static.springsource.org/spring-integration/reference/htmlsingle/ .
Google SEDA. Whether you go this route or not you should know about Message queues as it goes hand-in-hand with batch processing.
RabbitMQ has a good picture diagram of how messaging works
The contents of your message might be the entire row and whether its a CRUD, UPDATE, DELETE. You can use whatever format (e.g. JSON. See spring integration on recommendations).
- You could even send the direct SQL statements as a message!

BTW your concern of NOT IN being a performance problem is not a very good one as there are a plethora of work-arounds but given your not wanting to do DB specific things (like triggers and replication) I still feel a message queue is your best option.

EDIT - Non MQ route

Since I gave you a tough time about asking this quesiton I will continue to try to help. Besides the message queue you can do some sort of XML file like you we were trying before. THE CRITICAL FEATURE you need in the schema is a CREATE TIMESTAMP column on your master database so that you can do the batch processing while the system is up and running (otherwise you will have to stop the system). Now if you go this route you will want to SELECT * WHERE CREATE_TIME < ? is less than the current time. Basically your only getting the rows at a snapshot.

Now on your other database for the delete your going to remove rows by inner joining on a ID table but with != (that is you can use JOINS instead of slow NOT IN). Luckily you only need all the ids for delete and not the other columns. The other columns you can use a delta based on the the update time stamp column (for update, and create aka insert).

Synchronizing table data across databases

Answer

Related questions