Assume a mobile game that is backed by a MongoDB database containing a User
collection with several million documents.
Now assume several dozen properties that must be associated with the user - e.g. an array of _id
values of Friend
documents, their username, photo, an array of _id
values of Game
documents, last_login date, count of in-game currency, etc, etc, etc..
My concern is whether creating and updating large, growing arrays on many millions of User documents will add any 'weight' to each User document, and/or slowness to the overall system.
We will likely never eclipse 16mb per document, but we can safely say our documents will be 10-20x larger if we store these growing lists directly.
Question: is this even a problem in MongoDB? Does document size even matter if your queries are properly managed using projection and indexes, etc? Should we be actively pruning document size, e.g. with references to external lists vs. embedding lists of _id
values directly?
In other words: if I want a user's last_login
value, will a query that projects/selects only the last_login
field be any different if my User
documents are 100kb vs. 5mb?
Or: if I want to find all users with a specific last_login
value, will document size affect that sort of query?
First of all you should spend a little time reading up on how MongoDB stores documents with reference to padding factors and powerof2sizes allocation:
http://docs.mongodb.org/manual/core/storage/ http://docs.mongodb.org/manual/reference/command/collStats/#collStats.paddingFactor
Put simply MongoDB tries to allocate some additional space when storing your original document to allow for growth. Powerof2sizes allocation became the default approach in version 2.6, where it will grow the document size in powers of 2.
Overall, performance will be much better if all updates fit within the original size allocation. The reason is that if they don't, the entire document needs to be moved someplace else with enough space, causing more reads and writes and in effect fragmenting your storage.
If your documents are really going to grow in size by a factor of 10X to 20X overtime that could mean multiple moves per document, which depending on your insert, update and read frequency could cause issues. If that is the case there are a couple of approaches you can consider:
1) Allocate enough space on initial insertion to cover most (let's say 90%) of normal documents lifetime growth. While this will be inefficient in space usage at the beginning, efficiency will increase with time as the documents grow without any performance reduction. In effect you will pay ahead of time for storage that you will eventually use later to get good performance over time.
2) Create "overflow" documents - let's say a typical 80-20 rule applies and 80% of your documents will fit in a certain size. Allocate for that amount and add an overflow collection that your document can point to if they have more than 100 friends or 100 Game documents for example. The overflow field points to a document in this new collection and your app only looks in the new collection if the overflow field exists. Allows for normal document processing for 80% of the users, and avoids wasting a lot of storage on the 80% of user documents that won't need it, at the expense of additional application complexity.
In either case I'd consider using covered queries by building the appropriate indexes:
A covered query is a query in which:
all the fields in the query are part of an index, and all the fields returned in the results are in the same index.
Because the index “covers” the query, MongoDB can both match the query conditions and return the results using only the index; MongoDB does not need to look at the documents, only the index, to fulfill the query.
Querying only the index can be much faster than querying documents outside of the index. Index keys are typically smaller than the documents they catalog, and indexes are typically available in RAM or located sequentially on disk.
More on that approach here: http://docs.mongodb.org/manual/tutorial/create-indexes-to-support-queries/