I am using a combination of NLTK and scikit-learn
's CountVectorizer
for stemming words and tokenization.
Below is an example of the plain usage of the CountVectorizer
:
from sklearn.feature_extraction.text import CountVectorizer
vocab = ['The swimmer likes swimming so he swims.']
vec = CountVectorizer().fit(vocab)
sentence1 = vec.transform(['The swimmer likes swimming.'])
sentence2 = vec.transform(['The swimmer swims.'])
print('Vocabulary: %s' %vec.get_feature_names())
print('Sentence 1: %s' %sentence1.toarray())
print('Sentence 2: %s' %sentence2.toarray())
Which will print
Vocabulary: ['he', 'likes', 'so', 'swimmer', 'swimming', 'swims', 'the']
Sentence 1: [[0 1 0 1 1 0 1]]
Sentence 2: [[0 0 0 1 0 1 1]]
Now, let's say I want to remove stop words and stem the words. One option would be to do it like so:
from nltk import word_tokenize
from nltk.stem.porter import PorterStemmer
#######
# based on http://www.cs.duke.edu/courses/spring14/compsci290/assignments/lab02.html
stemmer = PorterStemmer()
def stem_tokens(tokens, stemmer):
stemmed = []
for item in tokens:
stemmed.append(stemmer.stem(item))
return stemmed
def tokenize(text):
tokens = nltk.word_tokenize(text)
stems = stem_tokens(tokens, stemmer)
return stems
########
vect = CountVectorizer(tokenizer=tokenize, stop_words='english')
vect.fit(vocab)
sentence1 = vect.transform(['The swimmer likes swimming.'])
sentence2 = vect.transform(['The swimmer swims.'])
print('Vocabulary: %s' %vect.get_feature_names())
print('Sentence 1: %s' %sentence1.toarray())
print('Sentence 2: %s' %sentence2.toarray())
Which prints:
Vocabulary: ['.', 'like', 'swim', 'swimmer']
Sentence 1: [[1 1 1 1]]
Sentence 2: [[1 0 1 1]]
But how would I best get rid of the punctuation characters in this second version?
There are several options, try remove the punctuation before tokenization. But this would mean that don't
-> dont
import string
def tokenize(text):
text = "".join([ch for ch in text if ch not in string.punctuation])
tokens = nltk.word_tokenize(text)
stems = stem_tokens(tokens, stemmer)
return stems
Or try removing punctuation after tokenization.
def tokenize(text):
tokens = nltk.word_tokenize(text)
tokens = [i for i in tokens if i not in string.punctuation]
stems = stem_tokens(tokens, stemmer)
return stems
The above code will work but it's rather slow because it's looping through the same text multiple times:
If you have more steps like removing digits or removing stopwords or lowercasing, etc.
It would be better to lump the steps together as much as possible, here's several better answers that is more efficient if your data requires more pre-processing steps: