In trying to run a database dump using a query from a db of about 5 billion, the progress bar times seem to indicate that this dump won't finish in any reasonable time (100+ days). The query also froze after it seems to have ended at 0%, around 22 or so hours later - the line after is a metadata.json line.
The dump line is:
mongodump -h myHost -d myDatabase -c mycollection --query "{'cr' : {\$gte: new Date(1388534400000)}, \$or: [ { 'tln': { \$lte: 0., \$gte: -100.}, 'tlt': { \$lte: 100, \$gte: 0} }, { 'pln': { \$lte: 0., \$gte: -100.}, 'plt': { \$lte: 100, \$gte: 0} } ] }"
And my last few lines of output was (typed as I can't post images yet.)
[timestamp] Collection File Writing Progress: 10214400/5066505869 0% (objects)
[timestamp] Collection File Writing Progress: 10225100/5066505869 0% (objects)
[timestamp] 10228391 objects
[timestamp] Metadata for database.collection to dump/database/collection.metadata.json
Any thoughts to help improve performance or any idea on why this is taking so long?
I've just faced this issue, and the problem is that mongodump
is basically not very smart. It's traversing the _id
index, which likely means lots and lots and lots of random disk access. For me, dumping several collections, mongodump
was simply crashing due to cursor timeouts.
The issue is also described here: https://jira.mongodb.org/browse/TOOLS-845. However, that doesn't really provide a great resolution part from "Works as Designed". It's possible there's something funny about the index, but I think in my case it was just a large enough collection that the amount of disk access was seriously hard work for my poor little Mac Mini.
One solution? Shut down writing and then use --forceTableScan
, which makes a sequential pass through the data, which might well be faster than using the _id
index if you are using a custom _id
field (I was).
The docs are a bit sketchy, but it reads as if the normal mongodump
behaviour might be to traverse the _id
index using a snapshot and then filter by the query. In other words, it might be traversing all 5 billion records in _id
order, not stored data order, i.e., randomly, to complete the query. So you might be better building a tool that reads from a real index and writes directly.
For me, --forceTableScan
was enough, and it meant (a) it actually completes successfully, and (b) it's an order of magnitude or more faster.