Fast way to find duplicates on indexed column in mongodb

Question 1

Fast way to find duplicates on indexed column in mongodb

mongodb mapreduce

Piotr Czapla · Nov 19, 2010 · Viewed 26.6k times · Source

Answer

Answer

I personally found that on big databases (1TB and more) accepted answer is terribly slow. Aggregation is much faster. Example is below:

db.places.aggregate(
    { $group : {_id : "$extra_info.id", total : { $sum : 1 } } },
    { $match : { total : { $gte : 2 } } },
    { $sort : {total : -1} },
    { $limit : 5 }
    );

It searches for documents whose extra_info.id is used twice or more times, sorts results in descending order of given field and prints first 5 values of it.

Question 2

I have a collection of md5 in mongodb. I'd like to find all duplicates. The md5 column is indexed. Do you know any fast way to do that using map reduce. Or should I just iterate over all records and check for duplicates manually?

My current approach using map reduce iterates over the collection almost twice (assuming that there is very small amount of duplicates):

res = db.files.mapReduce(
    function () {
        emit(this.md5, 1);
    }, 
    function (key, vals) {
        return Array.sum(vals);
    }
)

db[res.result].find({value: {$gte:1}}).forEach(
function (obj) {
    out.duplicates.insert(obj)
});

Fast way to find duplicates on indexed column in mongodb

Answer

Related questions