Top "Tokenize" questions

Tokenizing is the act of splitting a string into discrete elements called tokens.

Tokenization of Arabic words using NLTK

I'm using NLTK word_tokenizer to split a sentence into words. I want to tokenize this sentence: في_بيتنا كل شي لما تحتاجه يضيع ...ادور على شاحن فجأة يختفي ..لدرجة اني اسوي نفسي ادور شيء The code I'm …

python tokenize nltk
Difference between WhitespaceTokenizerFactory and StandardTokenizerFactory

I am new to Solr. By reading Solr's wiki, I don't understand the differences between WhitespaceTokenizerFactory and StandardTokenizerFactory. What's their …

solr tokenize
Tokenize, remove stop words using Lucene with Java

I am trying to tokenize and remove stop words from a txt file with Lucene. I have this: public String …

java lucene nlp tokenize stop-words
tokenizing a string twice in c with strtok()

I'm using strtok() in c to parse a csv string. First I tokenize it to just find out how many …

c csv tokenize strtok
How does a parser (for example, HTML) work?

For argument's sake lets assume a HTML parser. I've read that it tokenizes everything first, and then parses it. What …

html browser parsing html-parsing tokenize
How to build a parse tree of a mathematical expression?

I'm learning how to write tokenizers, parsers and as an exercise I'm writing a calculator in JavaScript. I'm using a …

parsing tokenize evaluation
How to Parse a logfile in powershell and write out desired output

I have a script which uses robocopy to transfer files and write logs to a file "Logfile.txt" after that, …

powershell powershell-2.0 tokenize robocopy logparser
Solr: exact phrase query with a EdgeNGramFilterFactory

In Solr (3.3), is it possible to make a field letter-by-letter searchable through a EdgeNGramFilterFactory and also sensitive to phrase queries? …

solr tokenize phrase
Java Lucene NGramTokenizer

I am trying tokenize strings into ngrams. Strangely in the documentation for the NGramTokenizer I do not see a method …

java lucene tokenize n-gram