how edge ngram token filter differs from ngram token filter?

Question 1

how edge ngram token filter differs from ngram token filter?

elasticsearch token analyzer

Karunakar · Jul 14, 2015 · Viewed 10.9k times · Source

Answer

Answer

I think the documentation is pretty clear on this:

This tokenizer is very similar to nGram but only keeps n-grams which start at the beginning of a token.

And the best example for nGram tokenizer again comes from the documentation:

curl 'localhost:9200/test/_analyze?pretty=1&analyzer=my_ngram_analyzer' -d 'FC Schalke 04'


    # FC, Sc, Sch, ch, cha, ha, hal, al, alk, lk, lke, ke, 04

With this tokenizer definition:

                    "type" : "nGram",
                    "min_gram" : "2",
                    "max_gram" : "3",
                    "token_chars": [ "letter", "digit" ]

In short:

the tokenizer, depending on the configuration, will create tokens. In this example: FC, Schalke, 04.
nGram generates groups of characters of minimum min_gram size and maximum max_gram size from an input text. Basically, the tokens are split into small chunks and each chunk is anchored on a character (it doesn't matter where this character is, all of them will create chunks).
edgeNGram does the same but the chunks always start from the beginning of each token. Basically, the chunks are anchored at the beginning of the tokens.

For the same text as above, an edgeNGram generates this: FC, Sc, Sch, Scha, Schal, 04. Every "word" in the text is considered and for every "word" the first character is the starting point (F from FC, S from Schalke and 0 from 04).

Question 2

As I am new to elastic search, I am not able to identify difference between ngram token filter and edge ngram token filter.

How these two differ from each other in processing tokens?

how edge ngram token filter differs from ngram token filter?

Answer

Related questions