I have a Use Case where I want to use ElasticSearch for realtime analytics. Within that, I want to be able to calculate some simple affinity scores.
Those are currently defined using the number of transactions a filtered-by-criteria user base performs, compared with the complete user base.
From my understanding, I'd need to do the following:
To get the "distinct transactions" for the filtered user base, I currently use a Terms Filter Query with faceting which returns all terms (transaction types). As far as I understand, I's need to use this result as input of a Terms Filter Query for the second step to be able to receive the result I want.
I read that there's a pull request on GitHub which seems to implement this (https://github.com/elasticsearch/elasticsearch/pull/3278), but it's not really obvious to me whether this is already usable in a current release or not.
If not, are there some workarounds how I could implement this?
As additional info, here is my sample mapping:
curl -XPUT 'http://localhost:9200/store/user/_mapping' -d '
{
"user": {
"properties": {
"user_id": { "type": "integer" },
"gender": { "type": "string", "index" : "not_analyzed" },
"age": { "type": "integer" },
"age_bracket": { "type": "string", "index" : "not_analyzed" },
"current_city": { "type": "string", "index" : "not_analyzed" },
"relationship_status": { "type": "string", "index" : "not_analyzed" },
"transactions" : {
"type": "nested",
"properties" : {
"t_id": { "type": "integer" },
"t_oid": { "type": "string", "index" : "not_analyzed" },
"t_name": { "type": "string", "index" : "not_analyzed" },
"tt_id": { "type": "integer" },
"tt_name": { "type": "string", "index" : "not_analyzed" },
}
}
}
}
}'
So, for my actual desired result for my example Use Case, I'd have the following:
Here's a link to a runnable example:
http://sense.qbox.io/gist/9da6a30fc12c36f90ae39111a08df283b56ec03c
It presumes documents that look like:
{ "transaction_type" : "some_transaction", "user_base" : "some_user_base_id" }
The query is set to return no results, since aggregations take care of computing the stats you're looking for:
{
"size" : 0,
"query" : {
"match_all" : {}
},
"aggs" : {
"distinct_transactions" : {
"terms" : {
"field" : "transaction_type",
"size" : 20
},
"aggs" : {
"by_user_base" : {
"terms" : {
"field" : "user_base",
"size" : 20
}
}
}
}
}
}
And here's what the result looks like:
"aggregations": {
"distinct_transactions": {
"buckets": [
{
"key": "subscribe",
"doc_count": 4,
"by_user_base": {
"buckets": [
{
"key": "2",
"doc_count": 3
},
{
"key": "1",
"doc_count": 1
}
]
}
},
{
"key": "purchase",
"doc_count": 3,
"by_user_base": {
"buckets": [
{
"key": "1",
"doc_count": 2
},
{
"key": "2",
"doc_count": 1
}
]
}
}
]
}
}
So, inside of "aggregations", you'll have a list of "distinct_transactions". The key will be the transaction type, and the doc_count will represent the total transactions by all users.
Inside of each "distinct_transaction", there's "by_user_base", which is another terms agg (nested). Just like the transactions, the key will represent the user base name (or ID or whatever) and the doc_count will represent that unique user base's # of transactions.
Is that what you were looking to do? Hope I helped.