Can't get allowDiskUse:True to work with pymongo

David Makovoz picture David Makovoz · Dec 3, 2014 · Viewed 17.1k times · Source

I'm running into the aggregation result exceeds maximum document size (16MB) error with mongodb aggregation using pymongo.

I was able to overcome it at first using the limit() option. However, at some point I got the

Exceeded memory limit for $group, but didn't allow external sort. Pass allowDiskUse:true to opt in." error.

Ok, I'll use the {'allowDiskUse':True} option. This option works when I use it on the commandline, but when I tried to use in my python code

result = work1.aggregate(pipe, 'allowDiskUse:true')

I get TypeError: aggregate() takes exactly 2 arguments (3 given) error. (that's in spite of the definition given at http://api.mongodb.org/python/current/api/pymongo/collection.html#pymongo.collection.Collection.aggregate: aggregate(pipeline, **kwargs)).

I tried to use runCommand, or rather it's pymongo equivalent:

db.command('aggregate','work1',pipe, {'allowDiskUse':True})

but now I'm back to the 'aggregation result exceeds maximum document size (16MB)' error

In case you need to know

pipe = [{'$project': {'_id': 0, 'summary.trigrams': 1}}, {'$unwind': '$summary'}, {'$unwind': '$summary.trigrams'}, {'$group': {'count': {'$sum': 1}, '_id': '$summary.trigrams'}}, {'$sort': {'count': -1}}, {'$limit': 10000}]

Thank you

Answer

Max Noel picture Max Noel · Dec 3, 2014

So, in order:

  • aggregate is a method. It takes 2 positional arguments (self, which is implicitly passed, and pipeline) and any number of keyword arguments (which must be passed as foo=bar -- if there's no = sign, it's not a keyword argument). This means you need to call result = work1.aggregate(pipe, allowDiskUse=True).

  • Your error about maximum document size is inherent to Mongo. Mongo can never return a document (or array thereof) larger than 16 megabytes. I can't tell you why because you have given us neither your data nor your code, but it probably means that the document you're building as an end result is too large. Try decreasing the $limit parameter, maybe? Start by setting it to 1, run a test, then increase it and look at how big the result gets when you do that.