In short: If you have a large number of documents with varying sizes, where relatively few documents hit the maximum object size, what are the best practices to store those documents in MongoDB?
I have set of documents like:
{_id: ...,
values: [12, 13, 434, 5555 ...]
}
The length of the values list varies hugely from one document to another. For the majority of documents, it will have a few elements, for a few it will have tens of millions of elements, and I will hit the maximum object size limit in MongoDB. The trouble is any special solution I come up with for those very large (and relatively few) documents might have an impact on how I store the small documents which would, otherwise, live happily in a MongoDB collection.
As far as I see, I have the following options. I would appreciate any input on pros and cons of those, and any other option that I missed.
1) Use another datastore: That seems too drastic. I like MongoDB, and it's not like I hit the size limit for many objects. In the words case, my application could treat the very large objects and the rest differently. It just doesn't seem elegant.
2) Use GridFS to store the values: Like a blob in a traditional DB, I could keep the first few thousand elements of values in document and if there are more elements in the list, I could keep the rest in a GridFS object as a binary file. I wouldn't be able to search in this part, but I can live with that.
3) Abuse GridFS: I could keep every document in gridFS. For the majority of the (small) documents the binary chunk would be empty because the files collection would be able to keep everything. For the rest I could keep the excess elements in the chunks collection. Does that introduce an overhead compared to option #2?
4) Really abuse GridFS: I could use the optional fields in the files collection of GridFS to store all elements in the values. Does GridFS do smart chunking also for the files collection?
5) Use an additional "relational" collection to store the one-to-many relation, but th number of documents in this collection would easily exceed a hundred billion rows.
If you have large documents, try to store some metadata about them in MongoDB, and put the rest of the data --the part you will not be querying on-- outside.