ElasticSearch returning only documents with distinct value

user962206 picture user962206 · Jul 1, 2014 · Viewed 18.2k times · Source

Let's say I have this given data

{
            "name" : "ABC",
            "favorite_cars" : [ "ferrari","toyota" ]
          }, {
            "name" : "ABC",
            "favorite_cars" : [ "ferrari","toyota" ]
          }, {
            "name" : "GEORGE",
            "favorite_cars" : [ "honda","Hyundae" ]
          }

Whenever I query this data when searching for people who's favorite car is toyota, it returns this data

{

            "name" : "ABC",
            "favorite_cars" : [ "ferrari","toyota" ]
          }, {
            "name" : "ABC",
            "favorite_cars" : [ "ferrari","toyota" ]
          }

the result is Two records of with a name of ABC. How do I select distinct documents only? The result I want to get is only this

{
                "name" : "ABC",
                "favorite_cars" : [ "ferrari","toyota" ]
              }

Here's my Query

{
    "fuzzy_like_this_field" : {
        "favorite_cars" : {
            "like_text" : "toyota",
            "max_query_terms" : 12
        }
    }
}

I am using ElasticSearch 1.0.0. with the java api client

Answer

JRL picture JRL · Jul 14, 2014

You can eliminate duplicates using aggregations. With term aggregation the results will be grouped by one field, e.g. name, also providing a count of the ocurrences of each value of the field, and will sort the results by this count (descending).

{
  "query": {
    "fuzzy_like_this_field": {
      "favorite_cars": {
        "like_text": "toyota",
        "max_query_terms": 12
      }
    }
  },
  "aggs": {
    "grouped_by_name": {
      "terms": {
        "field": "name",
        "size": 0
      }
    }
  }
}

In addition to the hits, the result will also contain the buckets with the unique values in key and with the count in doc_count:

{
  "took" : 4,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 0.19178301,
    "hits" : [ {
      "_index" : "pru",
      "_type" : "pru",
      "_id" : "vGkoVV5cR8SN3lvbWzLaFQ",
      "_score" : 0.19178301,
      "_source":{"name":"ABC","favorite_cars":["ferrari","toyota"]}
    }, {
      "_index" : "pru",
      "_type" : "pru",
      "_id" : "IdEbAcI6TM6oCVxCI_3fug",
      "_score" : 0.19178301,
      "_source":{"name":"ABC","favorite_cars":["ferrari","toyota"]}
    } ]
  },
  "aggregations" : {
    "grouped_by_name" : {
      "buckets" : [ {
        "key" : "abc",
        "doc_count" : 2
      } ]
    }
  }
}

Note that using aggregations will be costly because of duplicate elimination and result sorting.