Making sharding simple with Django

MiniQuark picture MiniQuark · Aug 2, 2012 · Viewed 9.5k times · Source

I have a Django project based on multiple PostgreSQL servers.

I want users to be sharded across those database servers using the same sharding logic used by Instagram:

User ID => logical shard ID => physical shard ID => database server => schema => user table

  • The logical shard ID is directly calculated from the user ID (13 bits embedded in the user id).
  • The mapping from logical to physical shard ID is hard coded (in some configuration file or static table).
  • The mapping from physical shard ID to database server is also hard coded. Instagram uses Pgbouncer at this point to retrieve a pooled database connection to the appropriate database server.
  • Each logical shard lives in its own PostgreSQL schema (for those not familiar with PostgreSQL, this is not a table schema, it's rather like a namespace, similar to MySQL 'databases'). The schema is simply named something like "shardNNNN", where NNNN is the logical shard ID.
  • Finally, the user table in the appropriate schema is queried.

How can this be achieved as simply as possible in Django ?

Ideally, I would love to be able to write Django code such as:

Fetching an instance

# this gets the user object on the appropriate server, in the appropriate schema:
user = User.objects.get(pk = user_id)

Fetching related objects

# this gets the user's posted articles, located in the same logical shard:
articles = user.articles

Creating an instance

# this selects a random logical shard and creates the user there:
user = User.create(name = "Arthur", title = "King")
# or:
user = User(name = "Arthur", title = "King")
user.save()

Searching users by name

# fetches all relevant users (kings) from all relevant logical shards
# - either by querying *all* database servers (not good)
# - or by querying a "name_to_user" table then querying just the
#   relevant database servers.
users = User.objects.filter(title = "King")

To make things even more complex, I use Streaming Replication to replicate every database server's data to multiple slave servers. The masters should be used for writes, and the slaves should be used for reads.

Django provides support for automatic database routing which is probably sufficient for most of the above, but I'm stuck with User.objects.get(pk = user_id) because the router does not have access to the query parameters, so it does not know what the user ID is, it just knows that the code is trying to read the User model.

I am well aware that sharding should probably be used only as a last resort optimization since it has limitations and really makes things quite complex. Most people don't need sharding: an optimized master/slave architecture can go a very long way. But let's assume I do need sharding.

In short: how can I shard data in Django, as simply as possible?

Thanks a lot for your kind help.

Note

There is an existing question which is quite similar, but IMHO it's too general and lacks precise examples. I wanted to narrow things down to a particular sharding technique I'm interested in (the Instagram way).

Answer

Streeter picture Streeter · Apr 9, 2013

Mike Clarke recently gave a talk at PyPgDay on how Disqus shards their users with Django and PostgreSQL. He wrote up a blog post on how they do it.

Several strategies can be employed when sharding Postgres databases. At Disqus, we chose to shard based on table name. Where as the original table name as generated by Django might be comments_post, our sharding tools will rewrite the SQL to query a table comments_post_X, where X is the shard ID calculated based on a consistent hashing scheme. All these tables live in a single schema, on a single database instance.

In addition, they released some code as part of a sample application demonstrating how they shard.