I need to autocomplete phrases. For example, when I search "dementia in alz", I want to get "dementia in alzheimer's".
For this, I configured Edge NGram tokenizer. I tried both edge_ngram_analyzer
and standard
as the analyzer in the query body. Nevertheless, I can't get results when I'm trying to match a phrase.
What am I doing wrong?
My query:
{
"query":{
"multi_match":{
"query":"dementia in alz",
"type":"phrase",
"analyzer":"edge_ngram_analyzer",
"fields":["_all"]
}
}
}
My mappings:
...
"type" : {
"_all" : {
"analyzer" : "edge_ngram_analyzer",
"search_analyzer" : "standard"
},
"properties" : {
"field" : {
"type" : "string",
"analyzer" : "edge_ngram_analyzer",
"search_analyzer" : "standard"
},
...
"settings" : {
...
"analysis" : {
"filter" : {
"stem_possessive_filter" : {
"name" : "possessive_english",
"type" : "stemmer"
}
},
"analyzer" : {
"edge_ngram_analyzer" : {
"filter" : [ "lowercase" ],
"tokenizer" : "edge_ngram_tokenizer"
}
},
"tokenizer" : {
"edge_ngram_tokenizer" : {
"token_chars" : [ "letter", "digit", "whitespace" ],
"min_gram" : "2",
"type" : "edgeNGram",
"max_gram" : "25"
}
}
}
...
My documents:
{
"_score": 1.1152233,
"_type": "Diagnosis",
"_id": "AVZLfHfBE5CzEm8aJ3Xp",
"_source": {
"@timestamp": "2016-08-02T13:40:48.665Z",
"type": "Diagnosis",
"Document_ID": "Diagnosis_1400541",
"Diagnosis": "F00.0 - Dementia in Alzheimer's disease with early onset",
"@version": "1",
},
"_index": "carenotes"
},
{
"_score": 1.1152233,
"_type": "Diagnosis",
"_id": "AVZLfICrE5CzEm8aJ4Dc",
"_source": {
"@timestamp": "2016-08-02T13:40:51.240Z",
"type": "Diagnosis",
"Document_ID": "Diagnosis_1424351",
"Diagnosis": "F00.1 - Dementia in Alzheimer's disease with late onset",
"@version": "1",
},
"_index": "carenotes"
}
Analysis of the "dementia in alzheimer" phrase:
{
"tokens": [
{
"end_offset": 2,
"token": "de",
"type": "word",
"start_offset": 0,
"position": 0
},
{
"end_offset": 3,
"token": "dem",
"type": "word",
"start_offset": 0,
"position": 1
},
{
"end_offset": 4,
"token": "deme",
"type": "word",
"start_offset": 0,
"position": 2
},
{
"end_offset": 5,
"token": "demen",
"type": "word",
"start_offset": 0,
"position": 3
},
{
"end_offset": 6,
"token": "dement",
"type": "word",
"start_offset": 0,
"position": 4
},
{
"end_offset": 7,
"token": "dementi",
"type": "word",
"start_offset": 0,
"position": 5
},
{
"end_offset": 8,
"token": "dementia",
"type": "word",
"start_offset": 0,
"position": 6
},
{
"end_offset": 9,
"token": "dementia ",
"type": "word",
"start_offset": 0,
"position": 7
},
{
"end_offset": 10,
"token": "dementia i",
"type": "word",
"start_offset": 0,
"position": 8
},
{
"end_offset": 11,
"token": "dementia in",
"type": "word",
"start_offset": 0,
"position": 9
},
{
"end_offset": 12,
"token": "dementia in ",
"type": "word",
"start_offset": 0,
"position": 10
},
{
"end_offset": 13,
"token": "dementia in a",
"type": "word",
"start_offset": 0,
"position": 11
},
{
"end_offset": 14,
"token": "dementia in al",
"type": "word",
"start_offset": 0,
"position": 12
},
{
"end_offset": 15,
"token": "dementia in alz",
"type": "word",
"start_offset": 0,
"position": 13
},
{
"end_offset": 16,
"token": "dementia in alzh",
"type": "word",
"start_offset": 0,
"position": 14
},
{
"end_offset": 17,
"token": "dementia in alzhe",
"type": "word",
"start_offset": 0,
"position": 15
},
{
"end_offset": 18,
"token": "dementia in alzhei",
"type": "word",
"start_offset": 0,
"position": 16
},
{
"end_offset": 19,
"token": "dementia in alzheim",
"type": "word",
"start_offset": 0,
"position": 17
},
{
"end_offset": 20,
"token": "dementia in alzheime",
"type": "word",
"start_offset": 0,
"position": 18
},
{
"end_offset": 21,
"token": "dementia in alzheimer",
"type": "word",
"start_offset": 0,
"position": 19
}
]
}
Many thanks to rendel who helped me to find the right solution!
The solution of Andrei Stefan is not optimal.
Why? First, the absence of the lowercase filter in the search analyzer makes search inconvenient; the case must be matched strictly. A custom analyzer with lowercase
filter is needed instead of "analyzer": "keyword"
.
Second, the analysis part is wrong!
During index time a string "F00.0 - Dementia in Alzheimer's disease with early onset" is analyzed by edge_ngram_analyzer
. With this analyzer, we have the following array of dictionaries as the analyzed string:
{
"tokens": [
{
"end_offset": 2,
"token": "f0",
"type": "word",
"start_offset": 0,
"position": 0
},
{
"end_offset": 3,
"token": "f00",
"type": "word",
"start_offset": 0,
"position": 1
},
{
"end_offset": 6,
"token": "0 ",
"type": "word",
"start_offset": 4,
"position": 2
},
{
"end_offset": 9,
"token": " ",
"type": "word",
"start_offset": 7,
"position": 3
},
{
"end_offset": 10,
"token": " d",
"type": "word",
"start_offset": 7,
"position": 4
},
{
"end_offset": 11,
"token": " de",
"type": "word",
"start_offset": 7,
"position": 5
},
{
"end_offset": 12,
"token": " dem",
"type": "word",
"start_offset": 7,
"position": 6
},
{
"end_offset": 13,
"token": " deme",
"type": "word",
"start_offset": 7,
"position": 7
},
{
"end_offset": 14,
"token": " demen",
"type": "word",
"start_offset": 7,
"position": 8
},
{
"end_offset": 15,
"token": " dement",
"type": "word",
"start_offset": 7,
"position": 9
},
{
"end_offset": 16,
"token": " dementi",
"type": "word",
"start_offset": 7,
"position": 10
},
{
"end_offset": 17,
"token": " dementia",
"type": "word",
"start_offset": 7,
"position": 11
},
{
"end_offset": 18,
"token": " dementia ",
"type": "word",
"start_offset": 7,
"position": 12
},
{
"end_offset": 19,
"token": " dementia i",
"type": "word",
"start_offset": 7,
"position": 13
},
{
"end_offset": 20,
"token": " dementia in",
"type": "word",
"start_offset": 7,
"position": 14
},
{
"end_offset": 21,
"token": " dementia in ",
"type": "word",
"start_offset": 7,
"position": 15
},
{
"end_offset": 22,
"token": " dementia in a",
"type": "word",
"start_offset": 7,
"position": 16
},
{
"end_offset": 23,
"token": " dementia in al",
"type": "word",
"start_offset": 7,
"position": 17
},
{
"end_offset": 24,
"token": " dementia in alz",
"type": "word",
"start_offset": 7,
"position": 18
},
{
"end_offset": 25,
"token": " dementia in alzh",
"type": "word",
"start_offset": 7,
"position": 19
},
{
"end_offset": 26,
"token": " dementia in alzhe",
"type": "word",
"start_offset": 7,
"position": 20
},
{
"end_offset": 27,
"token": " dementia in alzhei",
"type": "word",
"start_offset": 7,
"position": 21
},
{
"end_offset": 28,
"token": " dementia in alzheim",
"type": "word",
"start_offset": 7,
"position": 22
},
{
"end_offset": 29,
"token": " dementia in alzheime",
"type": "word",
"start_offset": 7,
"position": 23
},
{
"end_offset": 30,
"token": " dementia in alzheimer",
"type": "word",
"start_offset": 7,
"position": 24
},
{
"end_offset": 33,
"token": "s ",
"type": "word",
"start_offset": 31,
"position": 25
},
{
"end_offset": 34,
"token": "s d",
"type": "word",
"start_offset": 31,
"position": 26
},
{
"end_offset": 35,
"token": "s di",
"type": "word",
"start_offset": 31,
"position": 27
},
{
"end_offset": 36,
"token": "s dis",
"type": "word",
"start_offset": 31,
"position": 28
},
{
"end_offset": 37,
"token": "s dise",
"type": "word",
"start_offset": 31,
"position": 29
},
{
"end_offset": 38,
"token": "s disea",
"type": "word",
"start_offset": 31,
"position": 30
},
{
"end_offset": 39,
"token": "s diseas",
"type": "word",
"start_offset": 31,
"position": 31
},
{
"end_offset": 40,
"token": "s disease",
"type": "word",
"start_offset": 31,
"position": 32
},
{
"end_offset": 41,
"token": "s disease ",
"type": "word",
"start_offset": 31,
"position": 33
},
{
"end_offset": 42,
"token": "s disease w",
"type": "word",
"start_offset": 31,
"position": 34
},
{
"end_offset": 43,
"token": "s disease wi",
"type": "word",
"start_offset": 31,
"position": 35
},
{
"end_offset": 44,
"token": "s disease wit",
"type": "word",
"start_offset": 31,
"position": 36
},
{
"end_offset": 45,
"token": "s disease with",
"type": "word",
"start_offset": 31,
"position": 37
},
{
"end_offset": 46,
"token": "s disease with ",
"type": "word",
"start_offset": 31,
"position": 38
},
{
"end_offset": 47,
"token": "s disease with e",
"type": "word",
"start_offset": 31,
"position": 39
},
{
"end_offset": 48,
"token": "s disease with ea",
"type": "word",
"start_offset": 31,
"position": 40
},
{
"end_offset": 49,
"token": "s disease with ear",
"type": "word",
"start_offset": 31,
"position": 41
},
{
"end_offset": 50,
"token": "s disease with earl",
"type": "word",
"start_offset": 31,
"position": 42
},
{
"end_offset": 51,
"token": "s disease with early",
"type": "word",
"start_offset": 31,
"position": 43
},
{
"end_offset": 52,
"token": "s disease with early ",
"type": "word",
"start_offset": 31,
"position": 44
},
{
"end_offset": 53,
"token": "s disease with early o",
"type": "word",
"start_offset": 31,
"position": 45
},
{
"end_offset": 54,
"token": "s disease with early on",
"type": "word",
"start_offset": 31,
"position": 46
},
{
"end_offset": 55,
"token": "s disease with early ons",
"type": "word",
"start_offset": 31,
"position": 47
},
{
"end_offset": 56,
"token": "s disease with early onse",
"type": "word",
"start_offset": 31,
"position": 48
}
]
}
As you can see, the whole string tokenized with token size from 2 to 25 characters. The string is tokenized in a linear way together with all spaces and position incremented by one for every new token.
There are several problems with it:
edge_ngram_analyzer
produced unuseful tokens which will never be searched for, for example: "0 ", " ", " d", "s d", "s disease w" etc. "max_gram" : "25"
we “lost” some text in all fields. You can't search for this text anymore because there are no tokens for it.trim
filter only obfuscates the problem filtering extra spaces when it could be done by a tokenizer. edge_ngram_analyzer
increments the position of each token which is problematic for positional queries such as phrase queries. One should use the edge_ngram_filter
instead that will preserve the position of the token when generating the ngrams.The mappings settings to use:
...
"mappings": {
"Type": {
"_all":{
"analyzer": "edge_ngram_analyzer",
"search_analyzer": "keyword_analyzer"
},
"properties": {
"Field": {
"search_analyzer": "keyword_analyzer",
"type": "string",
"analyzer": "edge_ngram_analyzer"
},
...
...
"settings": {
"analysis": {
"filter": {
"english_poss_stemmer": {
"type": "stemmer",
"name": "possessive_english"
},
"edge_ngram": {
"type": "edgeNGram",
"min_gram": "2",
"max_gram": "25",
"token_chars": ["letter", "digit"]
}
},
"analyzer": {
"edge_ngram_analyzer": {
"filter": ["lowercase", "english_poss_stemmer", "edge_ngram"],
"tokenizer": "standard"
},
"keyword_analyzer": {
"filter": ["lowercase", "english_poss_stemmer"],
"tokenizer": "standard"
}
}
}
}
...
Look at the analysis:
{
"tokens": [
{
"end_offset": 5,
"token": "f0",
"type": "word",
"start_offset": 0,
"position": 0
},
{
"end_offset": 5,
"token": "f00",
"type": "word",
"start_offset": 0,
"position": 0
},
{
"end_offset": 5,
"token": "f00.",
"type": "word",
"start_offset": 0,
"position": 0
},
{
"end_offset": 5,
"token": "f00.0",
"type": "word",
"start_offset": 0,
"position": 0
},
{
"end_offset": 17,
"token": "de",
"type": "word",
"start_offset": 9,
"position": 2
},
{
"end_offset": 17,
"token": "dem",
"type": "word",
"start_offset": 9,
"position": 2
},
{
"end_offset": 17,
"token": "deme",
"type": "word",
"start_offset": 9,
"position": 2
},
{
"end_offset": 17,
"token": "demen",
"type": "word",
"start_offset": 9,
"position": 2
},
{
"end_offset": 17,
"token": "dement",
"type": "word",
"start_offset": 9,
"position": 2
},
{
"end_offset": 17,
"token": "dementi",
"type": "word",
"start_offset": 9,
"position": 2
},
{
"end_offset": 17,
"token": "dementia",
"type": "word",
"start_offset": 9,
"position": 2
},
{
"end_offset": 20,
"token": "in",
"type": "word",
"start_offset": 18,
"position": 3
},
{
"end_offset": 32,
"token": "al",
"type": "word",
"start_offset": 21,
"position": 4
},
{
"end_offset": 32,
"token": "alz",
"type": "word",
"start_offset": 21,
"position": 4
},
{
"end_offset": 32,
"token": "alzh",
"type": "word",
"start_offset": 21,
"position": 4
},
{
"end_offset": 32,
"token": "alzhe",
"type": "word",
"start_offset": 21,
"position": 4
},
{
"end_offset": 32,
"token": "alzhei",
"type": "word",
"start_offset": 21,
"position": 4
},
{
"end_offset": 32,
"token": "alzheim",
"type": "word",
"start_offset": 21,
"position": 4
},
{
"end_offset": 32,
"token": "alzheime",
"type": "word",
"start_offset": 21,
"position": 4
},
{
"end_offset": 32,
"token": "alzheimer",
"type": "word",
"start_offset": 21,
"position": 4
},
{
"end_offset": 40,
"token": "di",
"type": "word",
"start_offset": 33,
"position": 5
},
{
"end_offset": 40,
"token": "dis",
"type": "word",
"start_offset": 33,
"position": 5
},
{
"end_offset": 40,
"token": "dise",
"type": "word",
"start_offset": 33,
"position": 5
},
{
"end_offset": 40,
"token": "disea",
"type": "word",
"start_offset": 33,
"position": 5
},
{
"end_offset": 40,
"token": "diseas",
"type": "word",
"start_offset": 33,
"position": 5
},
{
"end_offset": 40,
"token": "disease",
"type": "word",
"start_offset": 33,
"position": 5
},
{
"end_offset": 45,
"token": "wi",
"type": "word",
"start_offset": 41,
"position": 6
},
{
"end_offset": 45,
"token": "wit",
"type": "word",
"start_offset": 41,
"position": 6
},
{
"end_offset": 45,
"token": "with",
"type": "word",
"start_offset": 41,
"position": 6
},
{
"end_offset": 51,
"token": "ea",
"type": "word",
"start_offset": 46,
"position": 7
},
{
"end_offset": 51,
"token": "ear",
"type": "word",
"start_offset": 46,
"position": 7
},
{
"end_offset": 51,
"token": "earl",
"type": "word",
"start_offset": 46,
"position": 7
},
{
"end_offset": 51,
"token": "early",
"type": "word",
"start_offset": 46,
"position": 7
},
{
"end_offset": 57,
"token": "on",
"type": "word",
"start_offset": 52,
"position": 8
},
{
"end_offset": 57,
"token": "ons",
"type": "word",
"start_offset": 52,
"position": 8
},
{
"end_offset": 57,
"token": "onse",
"type": "word",
"start_offset": 52,
"position": 8
},
{
"end_offset": 57,
"token": "onset",
"type": "word",
"start_offset": 52,
"position": 8
}
]
}
On index time a text is tokenized by standard
tokenizer, then separate words are filtered by lowercase
, possessive_english
and edge_ngram
filters. Tokens are produced only for words.
On search time a text is tokenized by standard
tokenizer, then separate words are filtered by lowercase
and possessive_english
. The searched words are matched against the tokens which had been created during the index time.
Thus we make the incremental search possible!
Now, because we do ngram on separate words, we can even execute queries like
{
'query': {
'multi_match': {
'query': 'dem in alzh',
'type': 'phrase',
'fields': ['_all']
}
}
}
and get correct results.
No text is "lost", everything is searchable and there is no need to deal with spaces by trim
filter anymore.