I have a set of words for which I have to check whether they are present in the documents.
WordList = [w1, w2, ..., wn]
Another set have list of documents on which I have to check whether these words are present or not.
How to use scikit-learn CountVectorizer
so that the features of term-document matrix are only words from WordList
and each row represents each particular document with no of times the word from the given list appears in their respective column?
Ok. I get it. The code is given below:
from sklearn.feature_extraction.text import CountVectorizer
# Counting the no of times each word(Unigram) appear in document.
vectorizer = CountVectorizer(input='content',binary=False,ngram_range=(1,1))
# First set the vocab
vectorizer = vectorizer.fit(WordList)
# Now transform the text contained in each document i.e list of text
Document:list
tfMatrix = vectorizer.transform(Document_List).toarray()
This will output only the term-document matrix with features from wordList only.