How to perform an exact match query on an analyzed field in Elasticsearch?

Zobayer Hasan picture Zobayer Hasan · Jan 19, 2016 · Viewed 7.2k times · Source

This is probably a very commonly asked question, however the answers I've got so far isn't satisfactory.

Problem: I have an es index that is composed of nearly 100 fields. Most of the fields are string type and set as analyzed. However, the query can be both partial (match) or exact (more like term). So, if my index contains a string field with value super duper cool pizza, there can be partial query like duper super and will match with the document, however, there can be exact query like cool pizza which should not match the document. On the other hand, Super Duper COOL PIzza again should match with this document.

So far, the partial match part is easy, I used AND operator in a match query. However can't get the other type done.

I have looked into other posts related to this problem and this post contains the closest solution: Elasticsearch exact matches on analyzed fields

Out of the three solutions, the first one feels very complex as I have a lot of fields and I do not use the REST api, I am creating queries dynamically using QueryBuilders with NativeSearchQueryBuilder from their Java api. Also it generates a lots of possible patterns which I think will cause performance issues.

The second one is a much easier solution but again, I have to maintain a lot more (almost) redundant data and, I don't think using term queries are ever going to solve my problem.

The last one has a problem I think, it will not prevent super duper to be matched with super duper cool pizza which is not the output I want.

So is there any other way I can achieve the goal? I can post some sample mapping if required for clearing the question farther. I am already keeping the source as well (in case that can be used). Please feel free to suggest any improvements as well.

Thanks in advance.

[UPDATE]

Finally, I used multi_field, keeping a raw field for exact queries. When I insert I use some custom modification on data, and during searching, I used the same modification routines on input text. This part is not handled by Elasticsearch. If you want to do that, you have to design appropriate analyzers as well.

Index settings and mapping queries:

PUT test_index

POST test_index/_close

PUT test_index/_settings
{
  "index": {
    "analysis": {
      "analyzer": {
        "standard_uppercase": {
          "type": "custom",
          "char_filter": ["html_strip"],
          "tokenizer": "keyword",
          "filter": ["uppercase"]
        }
      }
    }
  }
}

PUT test_index/doc/_mapping
{
  "doc": {
     "properties": {
        "text_field": {
           "type": "string",
           "fields": {
              "raw": {
                 "type": "string",
                 "analyzer": "standard_uppercase"
              }
           }
        }
     }
  }
}

POST test_index/_open

Inserting some sample data:

POST test_index/doc/_bulk
{"index":{"_id":1}}
{"text_field":"super duper cool pizza"}
{"index":{"_id":2}}
{"text_field":"some other text"}
{"index":{"_id":3}}
{"text_field":"pizza"}

Exact query:

GET test_index/doc/_search
{
  "query": {
    "bool": {
      "must": {
        "bool": {
          "should": {
            "term": {
             "text_field.raw": "PIZZA"
            }
          }
        }
      }
    }
  }
}

Response:

{
   "took": 1,
   "timed_out": false,
   "_shards": {
      "total": 1,
      "successful": 1,
      "failed": 0
   },
   "hits": {
      "total": 1,
      "max_score": 1.4054651,
      "hits": [
         {
            "_index": "test_index",
            "_type": "doc",
            "_id": "3",
            "_score": 1.4054651,
            "_source": {
               "text_field": "pizza"
            }
         }
      ]
   }
}

Partial query:

GET test_index/doc/_search
{
  "query": {
    "bool": {
      "must": {
        "bool": {
          "should": {
            "match": {
              "text_field": {
                "query": "pizza",
                "operator": "AND",
                "type": "boolean"
              }
            }
          }
        }
      }
    }
  }
}

Response:

{
   "took": 1,
   "timed_out": false,
   "_shards": {
      "total": 1,
      "successful": 1,
      "failed": 0
   },
   "hits": {
      "total": 2,
      "max_score": 1,
      "hits": [
         {
            "_index": "test_index",
            "_type": "doc",
            "_id": "3",
            "_score": 1,
            "_source": {
               "text_field": "pizza"
            }
         },
         {
            "_index": "test_index",
            "_type": "doc",
            "_id": "1",
            "_score": 0.5,
            "_source": {
               "text_field": "super duper cool pizza"
            }
         }
      ]
   }
}

PS: These are generated queries, that's why there are some redundant blocks, as there would be many other fields concatenated into the queries.

Sad part is, now I need to rewrite the whole mapping again :(

Answer

Sloan Ahrens picture Sloan Ahrens · Jan 20, 2016

I think this will do what you want (or at least come as close as is possible), using the keyword tokenizer and lowercase token filter:

PUT /test_index
{
   "settings": {
      "analysis": {
         "analyzer": {
            "lowercase_analyzer": {
               "type": "custom",
               "tokenizer": "keyword",
               "filter": ["lowercase_token_filter"]
            }
         },
         "filter": {
            "lowercase_token_filter": {
               "type": "lowercase"
            }
         }
      }
   },
   "mappings": {
      "doc": {
         "properties": {
            "text_field": {
               "type": "string",
               "fields": {
                  "raw": {
                     "type": "string",
                     "index": "not_analyzed"
                  },
                  "lowercase": {
                     "type": "string",
                     "analyzer": "lowercase_analyzer"
                  }
               }
            }
         }
      }
   }
}

I added a couple of docs for testing:

POST /test_index/doc/_bulk
{"index":{"_id":1}}
{"text_field":"super duper cool pizza"}
{"index":{"_id":2}}
{"text_field":"some other text"}
{"index":{"_id":3}}
{"text_field":"pizza"}

Notice we have the outer text_field set to be analyzed by the standard analyzer, then a sub-field raw that's not_analyzed (you may not want this one, I just added it for comparison), and another sub-field lowercase that creates tokens exactly the same as the input text, except that they have been lowercased (but not split on whitespace). So this match query returns what you expected:

POST /test_index/_search
{
    "query": {
        "match": {
           "text_field.lowercase": "Super Duper COOL PIzza"
        }
    }
}
...
{
   "took": 3,
   "timed_out": false,
   "_shards": {
      "total": 5,
      "successful": 5,
      "failed": 0
   },
   "hits": {
      "total": 1,
      "max_score": 0.30685282,
      "hits": [
         {
            "_index": "test_index",
            "_type": "doc",
            "_id": "1",
            "_score": 0.30685282,
            "_source": {
               "text_field": "super duper cool pizza"
            }
         }
      ]
   }
}

Remember that the match query will use the field's analyzer against the search phrase as well, so in this case searching for "super duper cool pizza" would have exactly the same effect as searching for "Super Duper COOL PIzza" (you could still use a term query if you want an exact match).

It's useful to take a look at the terms generated in each field by the three documents, since this is what your search queries will be working against (in this case raw and lowercase have the same tokens, but that's only because all the inputs were lower-case already):

POST /test_index/_search
{
   "size": 0,
   "aggs": {
      "text_field_standard": {
         "terms": {
            "field": "text_field"
         }
      },
      "text_field_raw": {
         "terms": {
            "field": "text_field.raw"
         }
      },
      "text_field_lowercase": {
         "terms": {
            "field": "text_field.lowercase"
         }
      }
   }
}
...{
   "took": 26,
   "timed_out": false,
   "_shards": {
      "total": 5,
      "successful": 5,
      "failed": 0
   },
   "hits": {
      "total": 3,
      "max_score": 0,
      "hits": []
   },
   "aggregations": {
      "text_field_raw": {
         "doc_count_error_upper_bound": 0,
         "sum_other_doc_count": 0,
         "buckets": [
            {
               "key": "pizza",
               "doc_count": 1
            },
            {
               "key": "some other text",
               "doc_count": 1
            },
            {
               "key": "super duper cool pizza",
               "doc_count": 1
            }
         ]
      },
      "text_field_lowercase": {
         "doc_count_error_upper_bound": 0,
         "sum_other_doc_count": 0,
         "buckets": [
            {
               "key": "pizza",
               "doc_count": 1
            },
            {
               "key": "some other text",
               "doc_count": 1
            },
            {
               "key": "super duper cool pizza",
               "doc_count": 1
            }
         ]
      },
      "text_field_standard": {
         "doc_count_error_upper_bound": 0,
         "sum_other_doc_count": 0,
         "buckets": [
            {
               "key": "pizza",
               "doc_count": 2
            },
            {
               "key": "cool",
               "doc_count": 1
            },
            {
               "key": "duper",
               "doc_count": 1
            },
            {
               "key": "other",
               "doc_count": 1
            },
            {
               "key": "some",
               "doc_count": 1
            },
            {
               "key": "super",
               "doc_count": 1
            },
            {
               "key": "text",
               "doc_count": 1
            }
         ]
      }
   }
}

Here's the code I used to test this out:

http://sense.qbox.io/gist/cc7564464cec88dd7f9e6d9d7cfccca2f564fde1

If you also want to do partial word matching, I would encourage you to take a look at ngrams. I wrote up an introduction for Qbox here:

https://qbox.io/blog/an-introduction-to-ngrams-in-elasticsearch