120 mongodb collections vs single collection - which one is more efficient?

user2297996 picture user2297996 · Apr 19, 2013 · Viewed 12.1k times · Source

I'm new to mongodb and I'm facing a dilemma regarding my DB Schema design:

Should I create one single collection or put my data into several collections (we could call these categories I suppose).

Now I know many such questions have been asked, but I believe my case is different for 2 reasons:

  • If I go for many collections, I'll have to create about 120 and that's it. This won't grow in the future.
  • I know I'll never need to query or insert into multiple collections. I will always have to query only one, since a document in collection X is not related to any document stored in the other collections. Documents may hold references to other parts of the DB though (like userId etc).

So my question is: could the 120 collections improve query performance? Is this a useful optimization in my case?

Or should I just go for single collection + sharding?

Each collection is expected hold millions of documents. If use only one, it will store billions of docs.

Thanks in advance!

------- Edit:

Thanks for the great answers.

In fact the 120 collections is only a self made limit, it's not really optimal:

The data in the collections is related to web publishers. There could be millions of these (any web site can join).

I guess the ideal situation would be if I could create a collection for each publisher (to hold their data only). But obviously, this is not possible due to mongo limitations.

So I came up with the idea of a fixed number of collections to at least distribute the data somehow. Like: collection "A_XX" would hold XX Platform related data for publishers whose names start with "A".. etc. We'll only support a few of these platforms, so 120 collections should be more than enough.

On another website someone suggested using many databases instead of many collections. But this means overhead and then I would have to use / manage many different connections.

What do you think about this? Is there a better solution?

Sorry for not being specific enough in my original question.

Thanks in advance

Answer

Stennie picture Stennie · Apr 19, 2013

Single Sharded Collection

The edited version of the question makes the actual requirement clearer: you have a collection that can potentially grow very large and you want an approach to partition the data. The artificial collection limit is your own planned partitioning scheme.

In that case, I think you would be best off using a single collection and taking advantage of MongoDB's auto-sharding feature to distribute the data and workload to multiple servers as required. Multiple collections is still a valid approach, but unnecessarily complicates your application code & deployment versus leveraging core MongoDB features. Assuming you choose a good shard key, your data will be automatically balanced across your shards.

You can do not have to shard immediately; you can defer the decision until you see your workload actually requiring more write scale (but knowing the option is there when you need it). You have other options before deciding to shard as well, such as upgrading your servers (disks and memory in particular) to better support your workload. Conversely, you don't want to wait until your system is crushed by workload before sharding so you definitely need to monitor the growth. I would suggest using the free MongoDB Monitoring Service (MMS) provided by 10gen.

On another website someone suggested using many databases instead of many collections. But this means overhead and then I would have to use / manage many different connections.

Multiple databases will add significantly more administrative overhead, and would likely be overkill and possibly detrimental for your use case. Storage is allocated at the database level, so 120 databases would be consuming much more space than a single database with 120 collections.

Fixed number of collections (original answer)

If you can plan for a fixed number of collections (120 as per your original question description), I think it makes more sense to take this approach rather than using a monolithic collection.

NOTE: the design considerations below still apply, but since the question was updated to clarify that multiple collections are an attempted partitioning scheme, sharding a single collection would be a much more straightforward approach.

The motivations for using separate collections would be:

  • Your documents for a single large collection will likely have to include some indication of the collection subtype, which may need to be added to multiple indexes and could significantly increase index sizes. With separate collections the subtype is already implicit in the collection namespace.

  • Sharding is enabled at the collection level. A single large collection only gives you an "all or nothing" approach, whereas individual collections allow you to control which subset(s) of data need to be sharded and choose more appropriate shard keys.

  • You can use the compact to command to defragment individual collections. Note: compact is a blocking operation, so the normal recommendation for a HA production environment would be to deploy a replica set and use rolling maintenance (i.e. compact the secondaries first, then step down and compact the primary).

  • MongoDB 2.4 (and 2.2) currently have database-level write lock granularity. In practice this has not proven a problem for the vast majority of use cases, however multiple collections would allow you to more easily move high activity collections into separate databases if needed.

  • Further to the previous point .. if you have your data in separate collections, these will be able to take advantage of future improvements in collection-level locking (see SERVER-1240 in the MongoDB Jira issue tracker).