Modeling relationships on CouchDB between documents?

dstaley picture dstaley · Jul 16, 2013 · Viewed 7.3k times · Source

I'm trying to model a fairly simple relationship in CouchDB and I'm having trouble determining the best way to accomplish this. I'd like users to be able to create lists of video game objects. I have the video game documents stored in the DB already with "type":"game". I'd like to be able to query a list object's ID (via a view) and get back the list's metadata (title, creation date, etc.) and portions of the game document (such as title and release date). Furthermore, I'd like to be able to add/removes games to/from lists without downloading the entire list document and posting it back (so this means I can't simply store the game's information in the list document) as I'd eventually like to support multiple users contributing to the same list, and I don't want to introduce conflicts.

After reading the CouchDB wiki on EntityRelationships, I've determined that setting up relationship documents might be the best solution.

Game:

{
    "_id": "2600emu",
    "type": "game"
}

List:

{
    "_id": 123,
    "title": "Emulators",
    "user_id": "dstaley",
    "type": "list"
}

Game-List Relationship:

{
    "_id": "98765456789876543",
    "type": "relationship",
    "list_id": 123,
    "game_id": "2600emu"
}

But, from what I understand, this wouldn't allow me to get the list's metadata and the game's metadata in one request. Any advice?

Answer

Mike Miller picture Mike Miller · Jul 16, 2013

Great question. You identify a few very important reasons that using a "normalized" data model (different document types with links) is an optimal model:

  1. You have a many-to-many relationship between users <==> lists <==> games.
  2. One-to-many relationships are easy to represent in a single document that uses a container for the 'many' portion, but they get large and you can have concurrency conflicts.
  3. Extending the single-doc model to store a many-to-many relationship is untenable.
  4. In general, document immutability is a great fit for concurrent systems. In CouchDB you do that exactly as you noted, by storing 'write-once' documents that represent an edge in your graph, then using secondary indexes to rebuild the portions of the links that you want and to get the information you want in a single API query call.

You're also right that the solution here is a 'map-side-join' (to borrow from the hadoop community). Basically you want to use different rows in the map output to represent different pieces of information. You can then use a range query (startkey/endkey) to query just the portion of the map result that you need, and, voila, your materialized view of the 'join' table. However, the one piece of the puzzle you didn't find in the documentation is this:

3.2.3. Joins With Views

3.2.3.1. Linked Documents

If your map function emits an object value which has {'_id': XXX} and you query view with include_docs=true parameter, then CouchDB will fetch the document with id XXX rather than the document which was processed to emit the key/value pair.

Says it all. That's how you dereference the pointer to the linked document that you have stored by a foreign key. You then combine that with the use of compound keys (keys that are JS arrays) and the view collation rules.

So that your view rows we be sorted like:

["list_1"], null
["list_1", "game"], {"_id":"game_1234"}
["list_1", "game"], {"_id":"game_5678"}
["list_2"], null
["list_2","game"], {"_id":"game1234"}
["list_3"], null
...

Putting this together with your existing data model, here is some (untested) pseudocode that should do the trick:

function(doc) {
    if (doc.type=="list") {
        //this is the one in the one-to-many
        emit( [doc._id]),);
    }
    else if (doc.type=="relationship") {
        //this is the many in the one-to-many
        //doc.list_id is our foreign key to the list.  We use that as the key
        //doc.game_id is the foreign key to the game.  We use that as the value
        emit( [doc.list_id,'game'],  {'_id': doc.game_id});
    }
}   

Finally, you would query that with a startkey/endkey so that you get all rows that begin with the list_id that you're interested in. It would look something like:

curl -g 'https://usr:[email protected]/db/_design/design_doc_name/_view/view_name?startkey=["123"]&endkey=["123",{}]&include_docs=true'

The -g option tells curl not to glob, which means you don't have to dereference your square brackets, etc, and the include_docs=true option will follow the pointer to the foreign key that you specified with game_id in the relationship document.

Analysis:

  1. You are using essentially immutable documents to store state changes and you let the database compute the aggregate state for you. That is a lovely model at scale, and one of our most successful patterns.
  2. Very efficient for additions or deletions to lists.
  3. Excellent scaling properties under high-concurrency
  4. At Cloudant (and CouchDB v2.0) we don't yet have 'read-your-writes consistency for secondary indexes. It is high in the priority list, but there are potential corner cases where, in failure scenarios or high-load, you may not see immediate consistency between the primary and secondary indexes. Long story short, quorum is used for primary indexes, but quorum isn't a viable model for secondary indexes, so another consistency strategy is being developed.