What's the best serialization method for objects in memcached?

mb. picture mb. · Jan 31, 2009 · Viewed 20k times · Source

My Python application currently uses the python-memcached API to set and get objects in memcached. This API uses Python's native pickle module to serialize and de-serialize Python objects.

This API makes it simple and fast to store nested Python lists, dictionaries and tuples in memcached, and reading these objects back into the application is completely transparent -- it just works.

But I don't want to be limited to using Python exclusively, and if all the memcached objects are serialized with pickle, then clients written in other languages won't work.

Here are the cross-platform serialization options I've considered:

  1. XML - the main benefit is that it's human-readable, but that's not important in this application. XML also takes a lot space, and it's expensive to parse.

  2. JSON - seems like a good cross-platform standard, but I'm not sure it retains the character of object types when read back from memcached. For example, according to this post tuples are transformed into lists when using simplejson; also, it seems like adding elements to the JSON structure could break code written to the old structure

  3. Google Protocol Buffers - I'm really interested in this because it seems very fast and compact -- at least 10 times smaller and faster than XML; it's not human-readable, but that's not important for this app; and it seems designed to support growing the structure without breaking old code

Considering the priorities for this app, what's the ideal object serialization method for memcached?

  1. Cross-platform support (Python, Java, C#, C++, Ruby, Perl)

  2. Handling nested data structures

  3. Fast serialization/de-serialization

  4. Minimum memory footprint

  5. Flexibility to change structure without breaking old code

Answer

gahooa picture gahooa · Feb 19, 2009

One major consideration is "do you want to have to specify each structure definition"?

If you are OK with that, then you could take a look at:

  1. Protocol Buffers - http://code.google.com/apis/protocolbuffers/docs/overview.html
  2. Thrift - http://developers.facebook.com/thrift/ (more geared toward services)

Both of these solutions require supporting files to define each data structure.


If you would prefer not to incur the developer overhead of pre-defining each structure, then take a look at:

  1. JSON (via python cjson, and native PHP json). Both are really really fast if you don't need to transmit binary content (such as images, etc...).
  2. Yet Another Markup Language @ http://www.yaml.org/. Also really fast if you get the right library.

However, I believe that both of these have had issues with transporting binary content, which is why they were ruled out for our usage. Note: YAML may have good binary support, you will have to check the client libraries -- see here: http://yaml.org/type/binary.html


At our company, we rolled our own library (Extruct) for cross-language serialization with binary support. We currently have (decently) fast implementations in Python and PHP, although it isn't very human readable due to using base64 on all the strings (binary support). Eventually we will port them to C and use more standard encoding.

Dynamic languages like PHP and Python get really slow if you have too many iterations in a loop or have to look at each character. C on the other hand shines at such operations.

If you'd like to see the implementation of Extruct, please let me know. (contact info at http://blog.gahooa.com/ under "About Me")