Top "Tokenize" questions

Tokenizing is the act of splitting a string into discrete elements called tokens.

remove stopwords and tokenize for collocationbigramfinder NLTK

I keep getting this error sub return _compile(pattern, flags).sub(repl, string, count) TypeError: expected string or buffer when …

python nltk tokenize stop-words
Sentence Segmentation using Spacy

I am new to Spacy and NLP. Facing the below issue while doing sentence segmentation using Spacy. The text I …

nlp tokenize spacy sentence
Search for name(text) with spaces in elasticsearch

Searching for names(text) with spaces in it, causing problem to me, I have mapping similar to "{"user":{"properties":{"name":{"…

search elasticsearch tokenize analyzer
What are all the Japanese whitespace characters?

I need to split a string and extract words separated by whitespace characters.The source may be in English or …

text unicode whitespace tokenize cjk
Does PL/SQL have an equivalent StringTokenizer to Java's?

I use java.util.StringTokenizer for simple parsing of delimited strings in java. I have a need for the same …

sql oracle plsql tokenize stringtokenizer
Tokenizing using Pandas and spaCy

I'm working on my first Python project and have reasonably large dataset (10's of thousands of rows). I need to …

python python-3.x pandas tokenize spacy
Tokenizing a String with tab delimiter in Java while skipping some tokens

I have a huge file with data (~8Gb / ~80 Million records). Every record has 6-8 attributes which are split by a …

java tokenize stringtokenizer
Using multiple tokenizers in Solr

What I want to be able to do is perform a query and get results back that are not case …

solr tokenize
Elasticsearch "pattern_replace", replacing whitespaces while analyzing

Basically I want to remove all whitespaces and tokenize the whole string as a single token. (I will use nGram …

elasticsearch whitespace tokenize removing-whitespace
Boost::Split using whole string as delimiter

I would like to know if there is a method using boost::split to split a string using whole strings …

c++ string boost tokenize