When to use horizontal partitioning and when to use database sharding?

database-design database-performance sharding database-partitioning

Dio Phung · Jun 1, 2014 · Viewed 9.5k times · Source

I'm reading this article on Wikipedia: http://en.wikipedia.org/wiki/Shard_(database_architecture) trying to find the major difference between these 2 techniques. Here is what I found:

Horizontal partitioning splits one or more tables by row, usually within a single instance of a schema and a database server. It may offer an advantage by reducing index size (and thus search effort) provided that there is some obvious, robust, implicit way to identify in which table a particular row will be found, without first needing to search the index, e.g., the classic example of the 'CustomersEast' and 'CustomersWest' tables, where their zip code already indicates where they will be found.

Sharding goes beyond this: it partitions the problematic table(s) in the same way, but it does this across potentially multiple instances of the schema. The obvious advantage would be that search load for the large partitioned table can now be split across multiple servers (logical or physical), not just multiple indexes on the same logical server.

As I understood, horizontal partitioning is more applicable for single instance (single node environment) whereas sharding is used in multi-node / multiple data-center environment. Is this correct? Or are there different usage scenarios ?

Extra question: For a huge table (with millions of rows) with simple schema (about 4-5 columns) , what is the best technique to improve read/write performance on this table ?

Answer

You are correct, horizontal partition (supported for example in MySQL and PostgreSQL) splits a table up within a single server. This can improve performance because data and indexes can be split across many disk volumes, improving I/O. This is usually done with a key range.

With database sharding, you are dividing the data across multiple servers, not just within a single server. In this case you use a shard key to partition the data, typically with some sort of hashing algorithm. You can get a white paper on this subject here (offered by our company, it is not specific to any product, it explains the technology): http://www.codefutures.com/database-sharding-white-paper/

The advantage of DBMS single server partitioning is that it is relatively simple to set up and manage. The disadvantage is ultimately you are limited by what a single server can do. This is particularly the case when it comes to heavy write contention, database locking and heavy queries.

Database Sharding takes more work, but has the advantage of being a shared-nothing approach, thus it is fully scalable.

A clear indicator that database sharding is needed is when a single server cannot keep up with write volume. If you have many heavy queries, this also can drive the need for this type of solution.

Having said all of this, if you are talking about "millions" of rows with 4 - 5 columns, and your reads can be well-indexed for fast access, it is doubtful if you need to implement either of these options. When you are talking 100s of millions or billions of rows, with 1000s of users, that is where database scalability is critical.

There is also an informational Web site I am working on regarding database scalability: www.bigdatascalability.com. It includes links to various articles, and will have new content added over time.

When to use horizontal partitioning and when to use database sharding?

Answer

Related questions