Sharing data between SQL databases

Matt picture Matt · May 4, 2013 · Viewed 16.2k times · Source

I'm trying to solve a problem, that for once, I didn't create.

I work in an environment with many web apps backed by different databases on different servers.

Each database is rather unique in its design and application but there still remains common data in each that I would like to abstract out. Each database, for instance has a vendors table, a users table, etc...

I would like to abstract out this common data to a single database but still let the other databases join on these tables, even have keys to enforce constraints, etc... I'm in a MsSql environment.

enter image description here

What are the options available? The way I see it, I have the following options:

  • Linked servers
  • Read only logins to give access to views

Is there anything else to consider?

Answer

Registered User picture Registered User · May 6, 2013

There are many ways you could tackle this problem. I would strongly recommend either solutions 1, 2, or 3 depending upon your business needs:

  1. Transactional Replication: If the common database is the record of account and you want to provide read-only versions of the data to separate applications, then you can replicate the core tables, possibly even just the core columns of tables, to each separate server. One upside of this approach is that you can replicate to as many subscriber databases as you want. This also means you can customize which tables and fields are available to the subscribers based on their needs. So if one application needs user tables and not vendor tables, then you only subscribe to the user tables. If another only needs vendor tables and not user tables, then you can subscribe to the vendor tables only. Another upside is that replication keeps itself synched and you can always reinitialize a subscription if a problem comes up.

    I have used transactional replication to push out over 100 tables from a data warehouse to separate downstream applications that needed access to aggregated data from multiple systems. Since our data warehouse was updated on an hourly scheduled from mirror and log shipping data sources, the production applications had data from numerous systems within a sliding window of 20 to 80 minutes each hour.

    Peer-to-Peer transactional replication as a publication type may be better suited for the use-case you provided. This can be really useful if you want to roll out schema or replication changes node by node. Standard transactional replication has some limitations in this area.

    Snapshot replication publication types has more latency than transactional publications, but you may want to consider it if a degree of latency is acceptable.

    Although you mentioned you are a Microsoft SQL Server shop, please keep in mind other RDBMs have similar technologies. Since you are talking about MS SQL Server specifically, please note transactional replication does allow you to replicate to Oracle databases, too. So if you have a few of these in your organization, this solution can still work.

    A downside to using transactional replication is that if you central server goes down you may begin experiencing latency with data in downstream copies of the replicated objects. If the replicated objects (articles) are really big and you need to reinitialize a table, then that can take a really long time to do, too.

  2. Mirrors: If you want to make the database accessible in near real-time on downstream servers, you could setup up to two asychronous mirrors. I've integrated data with a CRM application in this manner. All reads came from joins to the mirror. All writes were pushed to a message queue which then applied the changes to the central production server. The downside of this approach is that you can't create more than 2 asynchronous mirrors. You don't want to use synchronous mirrors for this purpose unless you are planning to use the mirrors for disaster recovery, too.

  3. Messaging Systems: If you expect to have numerous separate applications that need data from a single central database, then you may want to consider enterprise messaging systems like IBM Web Sphere, Microsoft BizTalk, Vitria, TIBCO, etc. These applications are built specifically to address this problem. They tend to be expensive and cumbersome to implement and maintain, but they can scale up if you have globally distributed systems or dozens of separate applications that all need to share data to some degree.

  4. Linked Servers: It sounds like you already thought of this one. You could expose the data via linked servers. I do not believe this is a good solution. If you really want to go this route, then consider setting up an asynchronous mirror from the central database to another server and then setup linked server connections to the mirror. This will at least mitigate the risk that a query from the web applications will cause blocking or performance problems with your central production database.

    IMO, linked servers tend to be a dangerous method for sharing data for applications. This approach still treats the data as a second-class citizen in your database. It leads to some pretty bad coding habits, particularly since your developers may be working on different servers in different languages with different connection methods. You don't know if someone is going to write a truly henious query against your core data. If you set a standard that requires pushing a full copy of the shared data down to the non-core server, then you don't have to worry about whether or not a developer writes bad code. At least from the perspective that their poor code won't jeapordize the performance of other well written systems.

    There are many, many resources out there that explain why using Linked Servers can be bad in this context. A non-exhaustive list of reasons includes: (a) the account used for the linked server must have DBCC SHOW STATISTICS permissions or the queries will not be able to make use of existing statistics, (b) query hints can't be uesd unless submitted as an OPENQUERY, (c) parameters can't be passed when used with OPENQUERY, (d) the server doesn't have sufficient statistics about the linked server, consequently, creates pretty terrible query plans, (e) network connectivity issues can cause failures, (f) any one of these five performance issues, and (g) the dreaded SSPI context error when trying to authenticate windows active directory credentials in a double hop scenario. Linked servers can be useful for some specific scenarios, but building access to a central database around this feature, although technically possible, is not recommended.

  5. Bulk ETL Process: If a high degree of latency is acceptable for the web applications, then you could write bulk ETL processes with SSIS (lots of good links in this StackOverflow question) which are executed by SQL Server agent jobs to move data between servers. There are also other alternative ETL tools like Informatica, Pentaho, etc., so use what works best for you.

    This is not a good solution if you need a low degree of latency. I have used this solution when synching to a 3rd party hosted CRM solution for fields that could tolerate high latency. For fields that could not tolerate high latency (basic account creation data) we relied on creating duplicate records in the CRM through web service calls at the point of account generation.

  6. Nightly Backup and Restores: If your data can tolerate high degrees of latency (up to a day) and periods of unavailability, then you could backup and restore the database across environments. This is not a good solution for web applications that need 100% up time. The idea is that you take a baseline backup, restore it to a separate restore name, then rename the original database and the new one as soon as the new one is ready for use. I've seen this done for some internal website applications, but I do not generally recommend this approach. That's better suited for a lower development environment, not a production environment.

  7. Log Shipping Secondaries: You could setup log shipping between the primary and any number of secondaries. This is similar to the nightly backup and restore process, except that you can update the database more frequently. In one instance this solution was used to expose data from one of our major core systems for downstream users by switching between two log shipping recipients. There was another server that pointed to the two databases and switched between them whenever the new one was available. I really hate this solution, but the one time I saw this implementation it did meet the needs of the business.