How to use the Scikit learn CountVectorizer?

Sanjeev picture Sanjeev · Dec 12, 2016 · Viewed 8.9k times · Source

I have a set of words for which I have to check whether they are present in the documents.

WordList = [w1, w2, ..., wn]

Another set have list of documents on which I have to check whether these words are present or not.

How to use scikit-learn CountVectorizer so that the features of term-document matrix are only words from WordList and each row represents each particular document with no of times the word from the given list appears in their respective column?

Answer

Sanjeev picture Sanjeev · Dec 12, 2016

Ok. I get it. The code is given below:

from sklearn.feature_extraction.text import CountVectorizer
# Counting the no of times each word(Unigram) appear in document. 
vectorizer = CountVectorizer(input='content',binary=False,ngram_range=(1,1))
# First set the vocab
vectorizer = vectorizer.fit(WordList)
# Now transform the text contained in each document i.e list of text 
Document:list
tfMatrix = vectorizer.transform(Document_List).toarray()

This will output only the term-document matrix with features from wordList only.