Difference between WhitespaceTokenizerFactory and StandardTokenizerFactory

solr tokenize

trillions · Jun 25, 2012 · Viewed 11.8k times · Source

I am new to Solr. By reading Solr's wiki, I don't understand the differences between WhitespaceTokenizerFactory and StandardTokenizerFactory. What's their real difference?

Answer

They differ in how they split the analyzed text into tokens.

The StandardTokenizer does this based on the following (taken from lucene javadoc):

Splits words at punctuation characters, removing punctuation. However, a dot that's not followed by whitespace is considered part of a token.
Splits words at hyphens, unless there's a number in the token, in which case the whole token is interpreted as a product number and is not split.
Recognizes email addresses and internet hostnames as one token.

The WhitespaceTokenizer does this based on whitespace characters:

A WhitespaceTokenizer is a tokenizer that divides text at whitespace. Adjacent sequences of non-Whitespace characters form tokens.

You should pick the tokenizer that best fits your application. In any case you have to use the same analyzer/tokenizers for indexing and searching!

Difference between WhitespaceTokenizerFactory and StandardTokenizerFactory

Answer

Related questions