There are many solutions to serialize a small dictionary: json.loads
/json.dumps
, pickle
, shelve
, ujson
, or even by using sqlite
.
But when dealing with possibly 100 GB of data, it's not possible anymore to use such modules that would possibly rewrite the whole data when closing / serializing.
redis
is not really an option because it uses a client/server scheme.
Question: Which key:value store, serverless, able to work with 100+ GB of data, are frequently used in Python?
I'm looking for a solution with a standard "Pythonic" d[key] = value
syntax:
import mydb
d = mydb.mydb('myfile.db')
d['hello'] = 17 # able to use string or int or float as key
d[183] = [12, 14, 24] # able to store lists as values (will probably internally jsonify it?)
d.flush() # easy to flush on disk
Note: BsdDB (BerkeleyDB) seems to be deprecated. There seems to be a LevelDB for Python, but it doesn't seem well-known - and I haven't found a version which is ready to use on Windows. Which ones would be the most common ones?
Linked questions: Use SQLite as a key:value store, Flat file NoSQL solution
You can use sqlitedict which provides key-value interface to SQLite database.
SQLite limits page says that theoretical maximum is 140 TB depending on page_size
and max_page_count
. However, default values for Python 3.5.2-2ubuntu0~16.04.4 (sqlite3
2.6.0), are page_size=1024
and max_page_count=1073741823
. This gives ~1100 GB of maximal database size which fits your requirement.
You can use the package like:
from sqlitedict import SqliteDict
mydict = SqliteDict('./my_db.sqlite', autocommit=True)
mydict['some_key'] = any_picklable_object
print(mydict['some_key'])
for key, value in mydict.items():
print(key, value)
print(len(mydict))
mydict.close()
About memory usage. SQLite doesn't need your dataset to fit in RAM. By default it caches up to cache_size
pages, which is barely 2MiB (the same Python as above). Here's the script you can use to check it with your data. Before run:
pip install lipsum psutil matplotlib psrecord sqlitedict
sqlitedct.py
#!/usr/bin/env python3
import os
import random
from contextlib import closing
import lipsum
from sqlitedict import SqliteDict
def main():
with closing(SqliteDict('./my_db.sqlite', autocommit=True)) as d:
for _ in range(100000):
v = lipsum.generate_paragraphs(2)[0:random.randint(200, 1000)]
d[os.urandom(10)] = v
if __name__ == '__main__':
main()
Run it like ./sqlitedct.py & psrecord --plot=plot.png --interval=0.1 $!
. In my case it produces this chart:
And database file:
$ du -h my_db.sqlite
84M my_db.sqlite