Why is PyMongo count_documents is slower than count?

Threegirl picture Threegirl · Sep 8, 2018 · Viewed 10.5k times · Source

In db['TF'] I have about 60 million records.

I need to get the quantity of the records.

If I run db['TF'].count(), it returns at once.

If I run db['TF'].count_documents({}), that is a such long time before I get the result.

However, the count method will be deprecated.

So, how can I get the quantity quickly when using count_documents? Is there some arguments I missed?

I have read the doc and code, but nothing found.

Thanks a lot!

Answer

Amit Wagner picture Amit Wagner · Sep 8, 2018

This is not about PyMongo but Mongo itself.

count is a native Mongo function. It doesn't really count all the documents. Whenever you insert or delete a record in Mongo, it caches the total number of records in the collection. Then when you run count, Mongo will return that cached value.

count_documents uses a query object, which means that it has to loop through all the records in order to get the total count. Because you're not passing any parameters, it will have to run over all 60 million records. This is why it is slow.

based on @Stennie comment

You can use estimated_document_count() in PyMongo 3.7+ to return the fast count based on collection metadata. The original count() was deprecated because the behaviour differed (estimated vs actual count) based on whether query criteria was provided. The newer driver API is more intentional about the outcome