Computing an md5 hash of a data structure

Ned Batchelder picture Ned Batchelder · Mar 24, 2011 · Viewed 31.4k times · Source

I want to compute an md5 hash not of a string, but of an entire data structure. I understand the mechanics of a way to do this (dispatch on the type of the value, canonicalize dictionary key order and other randomness, recurse into sub-values, etc). But it seems like the kind of operation that would be generally useful, so I'm surprised I need to roll this myself.

Is there some simpler way in Python to achieve this?

UPDATE: pickle has been suggested, and it's a good idea, but pickling doesn't canonicalize dictionary key order:

>>> import cPickle as pickle
>>> import hashlib, random 
>>> for i in range(10):
...  k = [i*i for i in range(1000)]
...  random.shuffle(k)
...  d = dict.fromkeys(k, 1)
...  p = pickle.dumps(d)
...  print hashlib.md5(p).hexdigest()
...
51b5855799f6d574c722ef9e50c2622b
43d6b52b885f4ecb4b4be7ecdcfbb04e
e7be0e6d923fe1b30c6fbd5dcd3c20b9
aebb2298be19908e523e86a3f3712207
7db3fe10dcdb70652f845b02b6557061
43945441efe82483ba65fda471d79254
8e4196468769333d170b6bb179b4aee0
951446fa44dba9a1a26e7df9083dcadf
06b09465917d3881707a4909f67451ae
386e3f08a3c1156edd1bd0f3862df481

Answer

webwurst picture webwurst · Apr 23, 2012

json.dumps() can sort dictionaries by key. So you don't need other dependencies:

import hashlib
import json

data = ['only', 'lists', [1,2,3], 'dictionaries', {'a':0,'b':1}, 'numbers', 47, 'strings']
data_md5 = hashlib.md5(json.dumps(data, sort_keys=True)).hexdigest()

print(data_md5)

Prints:

87e83d90fc0d03f2c05631e2cd68ea02