At the startup I'm working at we are now considering scaling solutions for our database. Things get somewhat confusing (for me at least) with MySQL, which has the MySQL cluster, replication and MySQL cluster replication (from ver. 5.1.6), which is an asynchronous version of the MySQL cluster. The MySQL manual explains some of the differences in its cluster FAQ, but it is hard to ascertain from it when to use one or the other.
I would appreciate any advice from people who are familiar with the differences between those solutions and what are the pros and cons, and when do you recommend to use each.
I've been doing A LOT of reading on the available options. I also got my hands on High Performance MySQL 2nd edition, which I highly recommend.
This is what I've managed to piece together:
Clustering in the general sense is distributing load across many servers that appear to an outside application as one server.
MySQL NDB Cluster is a distributed, in-memory, shared-nothing storage engine with synchronous replication and automatic data partioning (excuse me I borrow literally from the High Performance book, but they put it very nicely there). It can be a high performance solution for some applications, but web application generally do not work well on it.
The major problem is that beyond very simple queries (that touch only one table), the cluster will generally have to search for data on several nodes, allowing network latency to creep in and significantly slow down completion time for queries. Since the application treats the cluster as one computer, it can't tell it which node to fetch the data from.
In addition, the in-memory requirement is not workable for many large databases.
This is another clustering solution for MySQL, that acts as a middleware on top of the MySQL server. It offers synchronous replication, load balancing and failover. It also ensures that requests always get the data from the latest copy, automatically choosing a node that has the fresh data.
I've read some good things on it, and overall it sounds pretty promising.
Federation is similar to clustering, so I tugged it here as well. MySQL offers federation via the federated storage engine. Similar to the NDB cluster solution, it works well with simple queries only - but even worse the the cluster for complicated ones (since network latency is much higher).
MySQL has the built in capacity to create replications of a database on different servers. This can be used for many things - splitting the load between servers, hot backups, creating test servers and failover.
The basic setup of replication involves one master server handling mostly writes and one or more slaves handling reads only. A more advanced variation is that of the master-master configuration, which allows to scale writes as well by having several servers writing at the same time.
Each configuration has its pros and cons, but one problem they all share is replication lag - since MySQL replication is asynchronous, not all nodes have the freshest data at all time. This requires the application to be aware of the replication and incorporate replication-aware queries to work as expected. For some applications this might not be a problem, but if you always need the freshest data things get somewhat complicated.
Replication requires some load balancing to split the load between the nodes. This can be as simple as some modifications to the application code, or using dedicated software and hardware solutions.
Sharding is commonly used approach to scale database solutions. You split the data into smaller shards and spread them around different server nodes. This requires the application to be aware of the modification to the data storage to work efficiently, as it needs to know where to find the information it needs.
There are abstraction frameworks available to help deal with data sharding, such as Hibernate Shards, an extension to the Hibernate ORM (which unfortunately is in Java. I'm using PHP). HiveDB is another such solution which also supports shard rebalancing.
Sphinx is a full-text search engine, that can be used for far more than test searches. For many queries it is much faster than MySQL (especially for grouping and sorting), and can query remote systems in parallel and aggregate the results - which make it very useful in use with sharding.
In general sphinx should be used with other scaling solutions to get more of the available hardware and infrastructure. The downside is that again you need the application code to be aware of sphinx to use it wisely.
Scaling solutions differ depending on the needs of the application that needs it. For us and for most web-applications, I believe that replication (probably multi-master) is the way to go with a load balancer distributing the load. Sharding of specific problem areas (huge tables) is also a must for being able to scale horizontally.
I'm also going to give a shot to Continuent Sequoia and see if it can really do what it promises to since it will involve the least amount of changes to application code.