Serializiation with Protocol Buffers in a Schemaless Database

Évariste Galois picture Évariste Galois · Nov 25, 2010 · Viewed 9k times · Source

We're using MySQL to store schemaless data (see: Using a Relational Database for Schemaless Data for the solution inspired by how FriendFeed uses MySQL to store schemaless data).

One big table holds all entities for our application:

CREATE TABLE entities (
  added_id INT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY
, id BINARY(16) NOT NULL
, body MEDIUMBLOB
, UNIQUE KEY (id)
) ENGINE=InnoDB ;

A few details:

  • The only required property of stored entities is id, a 16-byte UUID. The rest of the entity is opaque to the database. We can change the "schema" simply by storing new properties in the body.

  • The added_id column is present because InnoDB stores data rows physically in primary key order. The AUTO_INCREMENT primary key ensures new entities are written sequentially on disk after old entities, which helps for read/write locality (new entities are read more frequently than old entities).

  • Our database stores our schemaless data in the body. <- This is the topic of this question.

  • Plenty of other interesting details, like "reaching into" the body data to build asynchronous materialized views (indexes are just tables that are built offline), but they're not relevant to the current discussion...

How should we be serializing the structured data (key-value pairs) in the body?

JSON or BSON would be simple, since the field names are repeated for each row. This gives it an advant­age in flex­ib­il­ity but also a big dis­ad­vant­age in space ef­fi­ciency (a per-row over­head for field names in the seri­al­ized data). We're trying to keep things in memory, and minimizing both memory and network footprint is important here. The more records we can fit in the same space, the faster our queries will be. We prefer relatively long, descriptive field names, and shortening them to make my database faster is wrong!

In the end, JSON/BSON is unworkable for our purposes, unless we get more complex and map small keys to more descriptive keys in the application driver that talks to the database. Which got us thinking...

Although our database is schemaless, in reality: 1) there aren't too many different kinds of entities, 2) versions of the same kind of entity don't change often, and 3) when they do change, it's usually just to add another field. JSON/BSON have no native support for versioning.

Protocol Buffers and Thrift are much more sophisticated when it comes to versioning and data definition changes. Both Thrift and Protocol Buffers are great candidates for serializing data into databases, and Thrift is designed so that the encoding format is extensible.

Protocol Buffers look like a great choice for serializing data in a schemaless database.

CouchDB and MongoDB (the two most popular schemaless databases?) use JSON and BSON respectively, but we can't find anything about using something more advanced, like Protocol Buffers, as a serialization format for storing schemaless data. There are products that store a specific language's version of objects (ie. storing Java's Externalizable objects in a datagrid, or doing NoSQL with MySQL in Ruby), but these are a pain (try accessing them from other platforms, or even from MySQL itself, and forget about versioning).

Is anyone storing the more interoperable Protocol Buffers in their database, or some other advanced serialization format in their schemaless database? This is a question of whether there are other options besides straightforward per-row serialization of JSON/BSON/XML, or serializing a specific language's objects. Is it even feasible? Are we missing something? sorry for the stream of consciousness style narrative!

Answer

Jeremiah Peschka picture Jeremiah Peschka · Nov 30, 2010

As you found out, MongoDB and CouchDB have strong opinions about how you store your data. If you're looking for a storage agnostic approach, you'll want to do something like @Joshua suggests and look at Cassandra or HBase. Even these two datastores have opinions about how data should be stored (they're both based on Google's Bigtable) and store data in column families.

Riak uses protocol buffers as one method of serializing data from your application into the datastore. It might be worth checking out to see if it fits your needs. It looks like you're largely planning to do single key lookups, so Riak may be a strong contender for your solution.