Handling optional/empty data in MongoDB

Oxymore picture Oxymore · May 15, 2013 · Viewed 14.8k times · Source

I remember reading somewhere that the mongo engine was more confortable when the entire structure of a document was already in place in case of an update, so here is the question.

When dealing with "empty" data, for example when inserting an empty string, should I default it to null, "" or not insert it at all ?

{
    _id: ObjectId("5192b6072fda974610000005"),
    description: ""
}

or

{
    _id: ObjectId("5192b6072fda974610000005"),
    description: null
}

or

{
    _id: ObjectId("5192b6072fda974610000005")
}

You have to remember that the description field may or may not be filled in every document (based on user input).

Answer

vinipsmaker picture vinipsmaker · May 15, 2013

Introduction

If a document doesn't have a value, the DB considers its value to be null. Suppose a database with the following documents:

{ "_id" : ObjectId("5192d23b1698aa96f0690d96"), "a" : 1, "desc" : "" }
{ "_id" : ObjectId("5192d23f1698aa96f0690d97"), "a" : 1, "desc" : null }
{ "_id" : ObjectId("5192d2441698aa96f0690d98"), "a" : 1 }

If you create a query to find documents with the field desc different than null, you will get just one document:

db.test.find({desc: {$ne: null}})
// Output:
{ "_id" : ObjectId("5192d23b1698aa96f0690d96"), "a" : 1, "desc" : "" }

The database doesn't differ documents without a desc field and documents with a desc field with the value null. One more test:

db.test.find({desc: null})
// Output:
{ "_id" : ObjectId("5192d2441698aa96f0690d98"), "a" : 1 }
{ "_id" : ObjectId("5192d23f1698aa96f0690d97"), "a" : 1, "desc" : null }

But the differences are only ignored in the queries, because, as shown in the last example above, the fields are still saved on disk and you'll receive documents with the same structure of the documents that were sent to the MongoDB.

Question

When dealing with "empty" data, for example when inserting an empty string, should I default it to null, "" or not insert it at all ?

There isn't much difference from {desc: null} to {}, because most of the operators will have the same result. You should only pay special attention to these two operators:

I'd save documents without the desc field, because the operators will continue to work as expected and I'd save some space.

Padding factor

If you know the documents in your database grow frequently, then MongoDB might need to move the documents during the update, because there isn't enough space in the previous document place. To prevent moving documents around, MongoDB allocates extra space for each document.

The ammount of extra space allocated by MongoDB per document is controlled by the padding factor. You cannot (and don't need to) choose the padding factor, because MongoDB will adaptively learn it, but you can help MongoDB preallocating internal space for each document by filling the possible future fields with null values. The difference is very small (depending on your application) and might be even smaller after MongoDB learn the best padding factor.

Sparse indexes

This section isn't too important to your specific problem right now, but may help you when you face similar problems.

If you create an unique index on field desc, then you wouldn't be able to save more than one document with the same value and in the previous database, we had more than one document with same value on field desc. Let's try to create an unique index in the previous presented database and see what error we get:

db.test.ensureIndex({desc: 1}, {unique: true})
// Output:
{
    "err" : "E11000 duplicate key error index: test.test.$desc_1  dup key: { : null }",
    "code" : 11000,
    "n" : 0,
    "connectionId" : 3,
    "ok" : 1
}

If we want to be able to create an unique index on some field and let some documents have this field empty, we should create a sparse index. Let's try to create the unique index again:

// No errors this time:
db.test.ensureIndex({desc: 1}, {unique: true, sparse: true})

So far, so good, but why am I explaining all this? Because there is a obscure behaviour about sparse indexes. In the following query, we expect to have ALL documents sorted by desc.

db.test.find().sort({desc: 1})
// Output:
{ "_id" : ObjectId("5192d23f1698aa96f0690d97"), "a" : 1, "desc" : null }
{ "_id" : ObjectId("5192d23b1698aa96f0690d96"), "a" : 1, "desc" : "" }

The result seems weird. What happened to the missing document? Let's try the query without sorting it:

{ "_id" : ObjectId("5192d23b1698aa96f0690d96"), "a" : 1, "desc" : "" }
{ "_id" : ObjectId("5192d23f1698aa96f0690d97"), "a" : 1, "desc" : null }
{ "_id" : ObjectId("5192d2441698aa96f0690d98"), "a" : 1 }

All documents were returned this time. What's happening? It's simple, but not so obvious. When we sort the result by desc, we use the sparse index created previously and there is no entries for the documents that haven't the desc field. The following query show us the use of the index to sort the result:

db.test.find().sort({desc: 1}).explain().cursor
// Output:
"BtreeCursor desc_1"

We can skip the index using a hint:

db.test.find().sort({desc: 1}).hint({$natural: 1})
// Output:
{ "_id" : ObjectId("5192d23f1698aa96f0690d97"), "a" : 1, "desc" : null }
{ "_id" : ObjectId("5192d2441698aa96f0690d98"), "a" : 1 }
{ "_id" : ObjectId("5192d23b1698aa96f0690d96"), "a" : 1, "desc" : "" }

Summary

  • Sparse unique indexes don't work if you include {desc: null}
  • Sparse unique indexes don't work if you include {desc: ""}
  • Sparse indexes might change the result of a query