Suppose I have a collection with some set of documents. something like this.
{ "_id" : ObjectId("4f127fa55e7242718200002d"), "id":1, "name" : "foo"}
{ "_id" : ObjectId("4f127fa55e7242718200002d"), "id":2, "name" : "bar"}
{ "_id" : ObjectId("4f127fa55e7242718200002d"), "id":3, "name" : "baz"}
{ "_id" : ObjectId("4f127fa55e7242718200002d"), "id":4, "name" : "foo"}
{ "_id" : ObjectId("4f127fa55e7242718200002d"), "id":5, "name" : "bar"}
{ "_id" : ObjectId("4f127fa55e7242718200002d"), "id":6, "name" : "bar"}
I want to find all the duplicated entries in this collection by the "name" field. E.g. "foo" appears twice and "bar" appears 3 times.
The accepted answer is terribly slow on large collections, and doesn't return the _id
s of the duplicate records.
Aggregation is much faster and can return the _id
s:
db.collection.aggregate([
{ $group: {
_id: { name: "$name" }, // replace `name` here twice
uniqueIds: { $addToSet: "$_id" },
count: { $sum: 1 }
} },
{ $match: {
count: { $gte: 2 }
} },
{ $sort : { count : -1} },
{ $limit : 10 }
]);
In the first stage of the aggregation pipeline, the $group
operator aggregates documents by the name
field and stores in uniqueIds
each _id
value of the grouped records.
The $sum operator adds up the values of the fields passed to it, in this case the constant 1
- thereby counting the number of grouped records into the count
field.
In the second stage of the pipeline, we use $match
to filter documents with a count
of at least 2, i.e. duplicates.
Then, we sort the most frequent duplicates first, and limit the results to the top 10.
This query will output up to $limit
records with duplicate names, along with their _id
s. For example:
{
"_id" : {
"name" : "Toothpick"
},
"uniqueIds" : [
"xzuzJd2qatfJCSvkN",
"9bpewBsKbrGBQexv4",
"fi3Gscg9M64BQdArv",
],
"count" : 3
},
{
"_id" : {
"name" : "Broom"
},
"uniqueIds" : [
"3vwny3YEj2qBsmmhA",
"gJeWGcuX6Wk69oFYD"
],
"count" : 2
}