How to Speed Up Mongodump, Dump Not Finishing

Wes picture Wes · Jan 19, 2015 · Viewed 7.1k times · Source

In trying to run a database dump using a query from a db of about 5 billion, the progress bar times seem to indicate that this dump won't finish in any reasonable time (100+ days). The query also froze after it seems to have ended at 0%, around 22 or so hours later - the line after is a metadata.json line.

The dump line is:

mongodump -h myHost -d myDatabase -c mycollection --query "{'cr' : {\$gte: new Date(1388534400000)}, \$or: [ { 'tln': { \$lte: 0., \$gte: -100.}, 'tlt': { \$lte: 100, \$gte: 0} }, { 'pln': { \$lte: 0., \$gte: -100.}, 'plt': { \$lte: 100, \$gte: 0} } ] }"

And my last few lines of output was (typed as I can't post images yet.)

[timestamp] Collection File Writing Progress: 10214400/5066505869 0% (objects)
[timestamp] Collection File Writing Progress: 10225100/5066505869 0% (objects)
[timestamp] 10228391 objects
[timestamp] Metadata for database.collection to dump/database/collection.metadata.json

Any thoughts to help improve performance or any idea on why this is taking so long?

Answer

Stuart Watt picture Stuart Watt · Jun 26, 2017

I've just faced this issue, and the problem is that mongodump is basically not very smart. It's traversing the _id index, which likely means lots and lots and lots of random disk access. For me, dumping several collections, mongodump was simply crashing due to cursor timeouts.

The issue is also described here: https://jira.mongodb.org/browse/TOOLS-845. However, that doesn't really provide a great resolution part from "Works as Designed". It's possible there's something funny about the index, but I think in my case it was just a large enough collection that the amount of disk access was seriously hard work for my poor little Mac Mini.

One solution? Shut down writing and then use --forceTableScan, which makes a sequential pass through the data, which might well be faster than using the _id index if you are using a custom _id field (I was).

The docs are a bit sketchy, but it reads as if the normal mongodump behaviour might be to traverse the _id index using a snapshot and then filter by the query. In other words, it might be traversing all 5 billion records in _id order, not stored data order, i.e., randomly, to complete the query. So you might be better building a tool that reads from a real index and writes directly.

For me, --forceTableScan was enough, and it meant (a) it actually completes successfully, and (b) it's an order of magnitude or more faster.