Shelve is too slow for large dictionaries, what can I do to improve performance?

Pietro Speroni picture Pietro Speroni · Aug 19, 2010 · Viewed 15.3k times · Source

I am storing a table using python and I need persistence.

Essentially I am storing the table as a dictionary string to numbers. And the whole is stored with shelve

self.DB=shelve.open("%s%sMoleculeLibrary.shelve"%(directory,os.sep),writeback=True) 

I use writeback to True as I found the system tends to be unstable if I don't.

After the computations the system needs to close the database, and store it back. Now the database (the table) is about 540MB, and it is taking ages. The time exploded after the table grew to about 500MB. But I need a much bigger table. In fact I need two of them.

I am probably using the wrong form of persistence. What can I do to improve performance?

Answer

nearlymonolith picture nearlymonolith · Aug 19, 2010

For storing a large dictionary of string : number key-value pairs, I'd suggest a JSON-native storage solution such as MongoDB. It has a wonderful API for Python, Pymongo. MongoDB itself is lightweight and incredibly fast, and json objects will natively be dictionaries in Python. This means that you can use your string key as the object ID, allowing for compressed storage and quick lookup.

As an example of how easy the code would be, see the following:

d = {'string1' : 1, 'string2' : 2, 'string3' : 3}
from pymongo import Connection
conn = Connection()
db = conn['example-database']
collection = db['example-collection']
for string, num in d.items():
    collection.save({'_id' : string, 'value' : num})
# testing
newD = {}
for obj in collection.find():
    newD[obj['_id']] = obj['value']
print newD
# output is: {u'string2': 2, u'string3': 3, u'string1': 1}

You'd just have to convert back from unicode, which is trivial.