As I am new to elastic search, I am not able to identify difference between ngram token filter and edge ngram token filter.
How these two differ from each other in processing tokens?
I think the documentation is pretty clear on this:
This tokenizer is very similar to nGram but only keeps n-grams which start at the beginning of a token.
And the best example for nGram
tokenizer again comes from the documentation:
curl 'localhost:9200/test/_analyze?pretty=1&analyzer=my_ngram_analyzer' -d 'FC Schalke 04'
# FC, Sc, Sch, ch, cha, ha, hal, al, alk, lk, lke, ke, 04
With this tokenizer definition:
"type" : "nGram",
"min_gram" : "2",
"max_gram" : "3",
"token_chars": [ "letter", "digit" ]
In short:
FC
, Schalke
, 04
.nGram
generates groups of characters of minimum min_gram
size and maximum max_gram
size from an input text. Basically, the tokens are split into small chunks and each chunk is anchored on a character (it doesn't matter where this character is, all of them will create chunks).edgeNGram
does the same but the chunks always start from the beginning of each token. Basically, the chunks are anchored at the beginning of the tokens.For the same text as above, an edgeNGram
generates this: FC, Sc, Sch, Scha, Schal, 04
. Every "word" in the text is considered and for every "word" the first character is the starting point (F
from FC
, S
from Schalke
and 0
from 04
).