ElasticSearch Join Filter: Using subquery results as filter input possible?

Tobi picture Tobi · Feb 17, 2014 · Viewed 13.9k times · Source

I have a Use Case where I want to use ElasticSearch for realtime analytics. Within that, I want to be able to calculate some simple affinity scores.

Those are currently defined using the number of transactions a filtered-by-criteria user base performs, compared with the complete user base.

From my understanding, I'd need to do the following:

  1. Get the distinct transactions of my filtered user base
  2. Query for these transaction (types) in the complete user base
  3. Do the calculation (norming etc.)

To get the "distinct transactions" for the filtered user base, I currently use a Terms Filter Query with faceting which returns all terms (transaction types). As far as I understand, I's need to use this result as input of a Terms Filter Query for the second step to be able to receive the result I want.

I read that there's a pull request on GitHub which seems to implement this (https://github.com/elasticsearch/elasticsearch/pull/3278), but it's not really obvious to me whether this is already usable in a current release or not.

If not, are there some workarounds how I could implement this?

As additional info, here is my sample mapping:

curl -XPUT 'http://localhost:9200/store/user/_mapping' -d '
{
  "user": {
    "properties": {
      "user_id": { "type": "integer" },
      "gender": { "type": "string", "index" : "not_analyzed" },
      "age": { "type": "integer" },
      "age_bracket": { "type": "string", "index" : "not_analyzed" },
      "current_city": { "type": "string", "index" : "not_analyzed" },
      "relationship_status": { "type": "string", "index" : "not_analyzed" },
      "transactions" : {
        "type": "nested",
        "properties" : {
          "t_id": { "type": "integer" },
          "t_oid": { "type": "string", "index" : "not_analyzed" },
          "t_name": { "type": "string", "index" : "not_analyzed" },
          "tt_id": { "type": "integer" },
          "tt_name": { "type": "string", "index" : "not_analyzed" },
        }
      }
    }
  }
}'

So, for my actual desired result for my example Use Case, I'd have the following:

  1. My filtered user base would have this example filter: "gender": "male" & "relationship_status": "single". For these, I want to get the distinct transaction types (field "tt_name" of the nested document) and count the number of distinct user_ids.
  2. Next, I want to query my complete user base (no filter other than the list of transaction types from 1.) and count the number of distinct user_ids
  3. Do the "affinity" calculations

Answer

Ben at Qbox.io picture Ben at Qbox.io · Feb 26, 2014

Here's a link to a runnable example:

http://sense.qbox.io/gist/9da6a30fc12c36f90ae39111a08df283b56ec03c

It presumes documents that look like:

{ "transaction_type" : "some_transaction", "user_base" : "some_user_base_id" }

The query is set to return no results, since aggregations take care of computing the stats you're looking for:

{
  "size" : 0,
  "query" : {
    "match_all" : {}
  },
  "aggs" : {
    "distinct_transactions" : {
      "terms" : {
        "field" : "transaction_type",
        "size" : 20
      },
      "aggs" : {
        "by_user_base" : {
          "terms" : {
            "field" : "user_base",
            "size" : 20
          }
        }
      }
    }
  }
}

And here's what the result looks like:

  "aggregations": {
      "distinct_transactions": {
         "buckets": [
            {
               "key": "subscribe",
               "doc_count": 4,
               "by_user_base": {
                  "buckets": [
                     {
                        "key": "2",
                        "doc_count": 3
                     },
                     {
                        "key": "1",
                        "doc_count": 1
                     }
                  ]
               }
            },
            {
               "key": "purchase",
               "doc_count": 3,
               "by_user_base": {
                  "buckets": [
                     {
                        "key": "1",
                        "doc_count": 2
                     },
                     {
                        "key": "2",
                        "doc_count": 1
                     }
                  ]
               }
            }
         ]
      }
   }

So, inside of "aggregations", you'll have a list of "distinct_transactions". The key will be the transaction type, and the doc_count will represent the total transactions by all users.

Inside of each "distinct_transaction", there's "by_user_base", which is another terms agg (nested). Just like the transactions, the key will represent the user base name (or ID or whatever) and the doc_count will represent that unique user base's # of transactions.

Is that what you were looking to do? Hope I helped.