Select Group by count and distinct count in same mongodb query

Rams picture Rams · Jul 15, 2014 · Viewed 36.6k times · Source

I am trying to do something like

select campaign_id,campaign_name,count(subscriber_id),count(distinct subscriber_id)
group by campaign_id,campaign_name from campaigns;

This query giving results except count(distinct subscriber_id)

db.campaigns.aggregate([
    {$match: {subscriber_id: {$ne: null}}},
    {$group: { 
        _id: {campaign_id: "$campaign_id",campaign_name: "$campaign_name"},
        count: {$sum: 1}
    }}
])

This following query giving results except count(subscriber_id)

db.campaigns_logs.aggregate([
    {$match : {subscriber_id: {$ne: null}}},
    {$group : { _id: {campaign_id: "$campaign_id",campaign_name: "$campaign_name",subscriber_id: "$subscriber_id"}}},
    {$group : { _id: {campaign_id: "$campaign_id",campaign_name: "$campaign_name"}, 
                count: {$sum: 1}
              }}
])

but I want count(subscriber_id),count(distinct subscriber_id) in the same result

Answer

Neil Lunn picture Neil Lunn · Jul 16, 2014

You are beginning to think along the right lines here as you were headed in the right direction. Changing your SQL mindset, "distinct" is really just another way of writing a $group operation in either language. That means you have two group operations happening here and, in aggregation pipeline terms, two pipeline stages.

Just with simplified documents to visualize:

{
    "campaign_id": "A",
    "campaign_name": "A",
    "subscriber_id": "123"
},
{
    "campaign_id": "A",
    "campaign_name": "A",
    "subscriber_id": "123"
},
{
    "campaign_id": "A",
    "campaign_name": "A",
    "subscriber_id": "456"
}

It stands to reason that for the given "campaign" combination the total count and "distinct" count are "3" and "2" respectively. So the logical thing to do is "group" up all of those "subscriber_id" values first and keep the count of occurrences for each, then while thinking "pipeline", "total" those counts per "campaign" and then just count the "distinct" occurrences as a separate number:

db.campaigns.aggregate([
    { "$match": { "subscriber_id": { "$ne": null }}},

    // Count all occurrences
    { "$group": {
        "_id": {
            "campaign_id": "$campaign_id",
            "campaign_name": "$campaign_name",
            "subscriber_id": "$subscriber_id"
        },
        "count": { "$sum": 1 }
    }},

    // Sum all occurrences and count distinct
    { "$group": {
        "_id": {
            "campaign_id": "$_id.campaign_id",
            "campaign_name": "$_id.campaign_name"
        },
        "totalCount": { "$sum": "$count" },
        "distinctCount": { "$sum": 1 }
    }}
])

After the first "group" the output documents can be visualized like this:

{ 
    "_id" : { 
        "campaign_id" : "A", 
        "campaign_name" : "A", 
        "subscriber_id" : "456"
    }, 
    "count" : 1 
}
{ 
    "_id" : { 
        "campaign_id" : "A", 
        "campaign_name" : "A", 
        "subscriber_id" : "123"
    }, 
    "count" : 2
}

So from the "three" documents in the sample, "2" belong to one distinct value and "1" to another. This can still be totaled with $sum in order to get the total matching documents which you do in the following stage, with the final result:

{ 
    "_id" : { 
        "campaign_id" : "A", 
        "campaign_name" : "A"
    },
    "totalCount" : 3,
    "distinctCount" : 2
}

A really good analogy for the aggregation pipeline is the unix pipe "|" operator, which allows "chaining" of operations so you can pass the output of one command through to the input of the next, and so on. Starting to think of your processing requirements in that way will help you understand operations with the aggregation pipeline better.