Python: in-memory object database which supports indexing?

David Wolever picture David Wolever · Mar 1, 2011 · Viewed 17.1k times · Source

I'm doing some data munging which would be quite a bit simpler if I could stick a bunch of dictionaries in an in-memory database, then run simply queries against it.

For example, something like:

people = db([
    {"name": "Joe", "age": 16},
    {"name": "Jane", "favourite_color": "red"},
])
over_16 = db.filter(age__gt=16)
with_favorite_colors = db.filter(favorite_color__exists=True)

There are three confounding factors, though:

  • Some of the values will be Python objects, and serializing them is out of the question (too slow, breaks identity). Of course, I could work around this (eg, by storing all the items in a big list, then serializing their indexes in that list… But that could take a fair bit of fiddling).
  • There will be thousands of data, and I will be running lookup-heavy operations (like graph traversals) against them, so it must be possible to perform efficient (ie, indexed) queries.
  • As in the example, the data is unstructured, so systems which require me to predefine a schema would be tricky.

So, does such a thing exist? Or will I need to kludge something together?

Answer

gotgenes picture gotgenes · Mar 1, 2011

What about using an in-memory SQLite database via the sqlite3 standard library module, using the special value :memory: for the connection? If you don't want to write your on SQL statements, you can always use an ORM, like SQLAlchemy, to access an in-memory SQLite database.

EDIT: I noticed you stated that the values may be Python objects, and also that you require avoiding serialization. Requiring arbitrary Python objects be stored in a database also necessitates serialization.

Can I propose a practical solution if you must keep those two requirements? Why not just use Python dictionaries as indices into your collection of Python dictionaries? It sounds like you will have idiosyncratic needs for building each of your indices; figure out what values you're going to query on, then write a function to generate and index for each. The possible values for one key in your list of dicts will be the keys for an index; the values of the index will be a list of dictionaries. Query the index by giving the value you're looking for as the key.

import collections
import itertools

def make_indices(dicts):
    color_index = collections.defaultdict(list)
    age_index = collections.defaultdict(list)
    for d in dicts:
        if 'favorite_color' in d:
            color_index[d['favorite_color']].append(d)
        if 'age' in d:
            age_index[d['age']].append(d)
    return color_index, age_index


def make_data_dicts():
    ...


data_dicts = make_data_dicts()
color_index, age_index = make_indices(data_dicts)
# Query for those with a favorite color is simply values
with_color_dicts = list(
        itertools.chain.from_iterable(color_index.values()))
# Query for people over 16
over_16 = list(
        itertools.chain.from_iterable(
            v for k, v in age_index.items() if age > 16)
)