MongoDB relationships: embed or reference?

Freewind picture Freewind · Mar 21, 2011 · Viewed 179.4k times · Source

I'm new to MongoDB--coming from a relational database background. I want to design a question structure with some comments, but I don't know which relationship to use for comments: embed or reference?

A question with some comments, like stackoverflow, would have a structure like this:

Question
    title = 'aaa'
    content = bbb'
    comments = ???

At first, I want to use embeded comments (I think embed is recommended in MongoDB), like this:

Question
    title = 'aaa'
    content = 'bbb'
    comments = [ { content = 'xxx', createdAt = 'yyy'}, 
                 { content = 'xxx', createdAt = 'yyy'}, 
                 { content = 'xxx', createdAt = 'yyy'} ]

It clear, but I'm worried about this case: If I want to edit a specified comment, how do I get its content and its question? There is no _id to let me find one, nor question_ref to let me find its question. (I'm so newbie, that I don't know if there's any way to do this without _id and question_ref.)

Do I have to use ref not embed? Then I have to create a new collection for comments?

Answer

John F. Miller picture John F. Miller · Mar 21, 2011

This is more an art than a science. The Mongo Documentation on Schemas is a good reference, but here are some things to consider:

  • Put as much in as possible

    The joy of a Document database is that it eliminates lots of Joins. Your first instinct should be to place as much in a single document as you can. Because MongoDB documents have structure, and because you can efficiently query within that structure (this means that you can take the part of the document that you need, so document size shouldn't worry you much) there is no immediate need to normalize data like you would in SQL. In particular any data that is not useful apart from its parent document should be part of the same document.

  • Separate data that can be referred to from multiple places into its own collection.

    This is not so much a "storage space" issue as it is a "data consistency" issue. If many records will refer to the same data it is more efficient and less error prone to update a single record and keep references to it in other places.

  • Document size considerations

    MongoDB imposes a 4MB (16MB with 1.8) size limit on a single document. In a world of GB of data this sounds small, but it is also 30 thousand tweets or 250 typical Stack Overflow answers or 20 flicker photos. On the other hand, this is far more information than one might want to present at one time on a typical web page. First consider what will make your queries easier. In many cases concern about document sizes will be premature optimization.

  • Complex data structures:

    MongoDB can store arbitrary deep nested data structures, but cannot search them efficiently. If your data forms a tree, forest or graph, you effectively need to store each node and its edges in a separate document. (Note that there are data stores specifically designed for this type of data that one should consider as well)

    It has also been pointed out than it is impossible to return a subset of elements in a document. If you need to pick-and-choose a few bits of each document, it will be easier to separate them out.

  • Data Consistency

    MongoDB makes a trade off between efficiency and consistency. The rule is changes to a single document are always atomic, while updates to multiple documents should never be assumed to be atomic. There is also no way to "lock" a record on the server (you can build this into the client's logic using for example a "lock" field). When you design your schema consider how you will keep your data consistent. Generally, the more that you keep in a document the better.

For what you are describing, I would embed the comments, and give each comment an id field with an ObjectID. The ObjectID has a time stamp embedded in it so you can use that instead of created at if you like.