ElasticSearch's Fuzzy Query

A_dit_rien picture A_dit_rien · Apr 25, 2012 · Viewed 19.3k times · Source

I am brand new to ElasticSearch, and am currently exploring its features. One of them I am interested in is the Fuzzy Query, which I am testing and having troubles to use. It is probably a dummy question so I guess someone who already used this feature will quickly find the answer, at least I hope. :)

BTW I have the feeling that it might not be only related to ElasticSearch but maybe directly to Lucene.

Let's start with a new index named "first index" in which I store an object "label" with value "american football". This is the query I use.

bash-3.2$ curl -XPOST 'http://localhost:9200/firstindex/node/?pretty=true' -d '{
  "node" : {
    "label" : "american football"
  }
}
'

This is the result I get.

{
  "ok" : true,
  "_index" : "firstindex",
  "_type" : "node",
  "_id" : "6TXNrLSESYepXPpFWjpl1A",
  "_version" : 1
}

So far so good, now I want to find this entry using a fuzzy query. This is the one I send:

bash-3.2$ curl -XGET 'http://localhost:9200/firstindex/node/_search?pretty=true' -d '{
  "query" : {
    "fuzzy" : {
      "label" : {
        "value" : "american football",
        "boost" : 1.0,
        "min_similarity" : 0.0,
        "prefix_length" : 0
      }                       
    }    
   }   
}
'

And this is the result I get

{
  "took" : 15,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 0,
    "max_score" : null,
    "hits" : [ ]
  }
}

As you can see, no hit. But now, when I shrink a bit my query's value from "american football" to "american footb" like this:

bash-3.2$ curl -XGET 'http://localhost:9200/firstindex/node/_search?pretty=true' -d ' {
  "query" : {
    "fuzzy" : {
      "label" : {
        "value" : "american footb",
        "boost" : 1.0,
        "min_similarity" : 0.0,
        "prefix_length" : 0
      }
    }
  }
}
'

Then I get a correct hit on my entry, thus the result is:

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.19178301,
    "hits" : [ {
      "_index" : "firstindex",
      "_type" : "node",
      "_id" : "6TXNrLSESYepXPpFWjpl1A",
      "_score" : 0.19178301, "_source" : {
        "node" : {
          "label" : "american football"
        }
      }
    } ]
  }
}

So, I have several questions related to this test:

  1. Why I didn't get any result when performing a query with a value completely equals the my only entry "american football"

  2. Is it related to the fact that I have a multi-words value?

  3. Is there a way to get the "similarity" score in my query result so I can understand better how to find the right threshold for my fuzzy queries

  4. There is a page dedicated to Fuzzy Query on ElasticSearch web site, but I am not sure it lists all the potential parameters I can use for the fuzzy query. Were could I find such an exhaustive list?

  5. Same question for the other queries actually.

  6. is there a difference between a Fuzzy Query and a Query String Query using lucene syntax to get fuzzy matching?

Answer

imotov picture imotov · Apr 28, 2012

1.

The fuzzy query operates on terms. It cannot handle phrases because it doesn't analyze the text. So, in your example, elasticsearch tries to match the term "american football" to the term american and to the term football. The match between terms is based on Levenshtein distance, which is used to calculate similarity score. Since you have min_similarity=0.0 any term should match any term as long as edit distance is smaller than the size of the smallest term. In your case, the term "american football" has size 17 and the term "american" has size 8. The distance between these two terms is 9 which is bigger than the size of the smallest term 8. So, as a result, this term is getting rejected. The edit distance between "american footb" and "american" is 6. It's basically the term "american" with 6 additions at the end. That's why it produces results. With min_similarity=0.0 pretty much anything with edit distance 7 or less will match. You will even get results while searching for "aqqqqqq", for example.

2.

Yes, as I explained above, it is somewhat related to multi-word values. If you want to search for multiple terms, take a look at Fuzzy Like This Query and fuzziness parameter of Text Query

4 & 5.

Usually, the next best source of information after elasticsearch.org is elasticsearch source code.