Comparing MongoDB and RethinkDB Bulk Insert Performance

njyunis picture njyunis · Mar 1, 2013 · Viewed 11.9k times · Source

This is my official first question here; I welcome any/all criticism of my post so that I can learn how to be a better SO citizen.

I am vetting non-relational DBMS for storing potentially large email opt-out lists, leaning toward either MongoDB or RethinkDB, using their respective Python client libraries. The pain point of my application is bulk insert performance, so I have set up two Python scripts to insert 20,000 records in batches of 5,000 into both a MongoDB and a RethinkDB collection.

The MongoDB python script mongo_insert_test.py:

NUM_LINES = 20000
BATCH_SIZE = 5000

def insert_records():
    collection = mongo.recips
    i = 0
    batch_counter = 0
    batch = []
    while i <= NUM_LINES:
        i += 1
        recip = {
            'address': "test%d@test%d.com" % (i, i)
        }
        if batch_counter <= BATCH_SIZE:
            batch.append(recip)
            batch_counter += 1
        if (batch_counter == BATCH_SIZE) or i == NUM_LINES:
            collection.insert(batch)
            batch_counter = 0
            batch = []

if __name__ == '__main__':
    insert_records()

The almost identical RethinkDB python script rethink_insert_test.py:

NUM_LINES = 20000
BATCH_SIZE = 5000

def insert_records():
    i = 0
    batch_counter = 0
    batch = []
    while i <= NUM_LINES:
        i += 1
        recip = {
            'address': "test%d@test%d.com" % (i, i)
        }
        if batch_counter <= BATCH_SIZE:
            batch.append(recip)
            batch_counter += 1
        if (batch_counter == BATCH_SIZE) or i == NUM_LINES:
            r.table('recip').insert(batch).run()
            batch_counter = 0
            batch = []

if __name__ == '__main__':
    insert_records()

In my dev environment, the MongoDB script inserts 20,000 records in under a second:

$ time python mongo_insert_test.py 
real    0m0.618s
user    0m0.400s
sys     0m0.032s

In the same environment, the RethinkDB script performs much slower, inserting 20,000 records in over 2 minutes:

$ time python rethink_insert_test.py
real    2m2.502s
user    0m3.000s
sys     0m0.052s

Am I missing something huge here with regard to how these two DBMS work? Why is RethinkDB performing so badly with this test?

My dev machine had about 1.2GB available memory for these tests.

Answer

coffeemug picture coffeemug · Mar 1, 2013

RethinkDB currently implements batch inserts by doing a single insert at a time on the server. Since Rethink flushes every record to disk (because it's designed with safety first in mind), this has a really bad effect on workloads like this one.

We're doing two things to address this:

  1. Bulk inserts will be implemented via a bulk insert algorithm on the server to avoid doing one insert at a time.
  2. We will give you the option to relax durability constraints to allow the cache memory to absorb high-throughput inserts if you'd like (in exchange for not syncing to disk as often).

This will absolutely be fixed in 4-12 weeks (and if you need this ASAP, feel free to shoot me an email to [email protected] and I'll see if we can reprioritize).

Here are the relevant github issues:

https://github.com/rethinkdb/rethinkdb/issues/207

https://github.com/rethinkdb/rethinkdb/issues/314

Hope this helps. Please don't hesitate to ping us if you need help.