Single or Multiple Entities Per Collection in DocumentDB

Abhijeet picture Abhijeet · Dec 13, 2014 · Viewed 9.4k times · Source

Should there be one entity per collection in document DB?

Consider I have foreign key relationship in below diagram: enter image description here

Should I create two collections one for employee & other for company. Or should I store them into a single collection?

I read here that in documentdb scope of stored procedures triggers etc are within a collection. So by splitting differetn entities into separate collection I loose out of box functionality.

So wouldn't it be better to dump both the classes as single entity as below:

{
  "Id": 1001,
  "Industry": "Software",
  "Employees": [
    {
      "Id": 10011,
      "Name": "John Doe",
      "CompanyId": 1001
    },
    {
      "Id": 10012,
      "Name": "Jane Doe",
      "CompanyId": 1001
    }
  ]
}

What is the standard practise of implementing related entities in DocumentDB?

Answer

Andrew Liu picture Andrew Liu · Dec 14, 2014

It is generally good to store multiple entity types per collection. Whether to store entity types within to a single document or not takes a bit more thought.

As David mentioned - how to model data is a bit subjective.

Storing Multiples Entity Types within a Collection

First... let's talk about storing multiple entities in a collection. DocumentDB collections are not tables. Collections do not enforce schema; in other words, you can store different types of documents with different schemas in the same collection. You can track different types of entities simply by adding a type attribute to your document.

You should think of Collections as a unit of partition and boundary for the execution of queries and transactions. Thus a huge perk for storing different entity types within the same collection is you get transaction support right out of the box via sprocs.

Storing Multiple Entity Types within a Document

Whether you store multiple entity types within a single document takes a bit more thought. This is commonly referred to de-normalizing (capturing relationships between data by embedding data in a single document) and normalizing (capturing relationships between data by creating weak links to o other documents) your data.

Typically de-normalizing provides better read performance.

The application may need to issue fewer queries and updates to complete common operations.

In general, use de-normalized data models when:

  • have “contains” relationships between entities
  • have one-to-few relationships between entities
  • de-normalized data changes infrequently
  • de-normalized data won’t grow without bound
  • de-normalized data is integral to data in document

Example of a de-normalized data model:

{
  "Id": 1001,
  "Type": "Company",
  "Industry": "Software",
  "Employees": [
    {
      "Id": 10011,
      "Type": "Employee",
      "Name": "John Doe"
    },
    {
      "Id": 10012,
      "Type": "Employee",
      "Name": "Jane Doe"
    }
  ]
}

Typically normalizing provides better write performance.

Provides more flexibility than de-normalizing

Client-side applications must issue follow-up queries to resolve the references. In other words, normalized data models can require more round trips to the server.

In general, use normalized data models:

  • when de-normalizing would result in duplication of data but would not provide sufficient read performance advantages to outweigh the implications of the duplication.
  • representing one-to-many relationships
  • represent many-to-many relationships.
  • related data changes frequently

Example of a normalized data model:

{
  "Id": 1001,
  "Type": "Company",
  "Industry": "Software"
}

{
  "Id": 10011,
  "Type": "Employee",
  "Name": "John Doe",
  "CompanyId": 1001
}

{
  "Id": 10012,
  "Type": "Employee",
  "Name": "Jane Doe",
  "CompanyId": 1001
}

Hybrid Approaches

Choosing between normalizing and de-normalizing doesn't have to be a black and white choice. I've often found that a winning design pattern is a hybrid approach, in which you may choose to normalize a partial set of an object's fields and de-normalize the others.

In other words, you could choose to de-normalize frequently read stable (or immutable) properties to reduce the need for follow up queries, while normalize frequently written / mutating fields to reduce the need for fanning out writes.

Example of a hybrid approach:

// Author documents:
[{
  "id": 1,
  "firstName": "Thomas",
  "lastName": "Andersen",
  "countOfBooks": 3,
  "books": [1, 2, 3],
  "images": [{
    "thumbnail": "http://....png"
  }, {
    "profile": "http://....png"
  }, {
    "large": "http://....png"
  }]
}, {
  "id": 2,
  "firstName": "William",
  "lastName": "Wakefield",
  "countOfBooks": 1,
  "books": [1, 4, 5],
  "images": [{
    "thumbnail": "http://....png"
  }]
}]

// Book documents:
[{
  "id": 1,
  "name": "DocumentDB 101",
  "authors": [{
    "id": 1,
    "name": "Thomas Andersen",
    "thumbnailUrl": "http://....png"
  }, {
    "id": 2,
    "name": "William Wakefield",
    "thumbnailUrl": "http://....png"
  }]
}, {
  "id": 2,
  "name": "DocumentDB for RDBMS Users",
  "authors": [{
    "id": 1,
    "name": "Thomas Andersen",
    "thumbnailUrl": "http://....png"
  }, ]
}]