I have been working with the CountVectorizer
class in scikit-learn.
I understand that if used in the manner shown below, the final output will consist of an array containing counts of features, or tokens.
These tokens are extracted from a set of keywords, i.e.
tags = [
"python, tools",
"linux, tools, ubuntu",
"distributed systems, linux, networking, tools",
]
The next step is:
from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer(tokenizer=tokenize)
data = vec.fit_transform(tags).toarray()
print data
Where we get
[[0 0 0 1 1 0]
[0 1 0 0 1 1]
[1 1 1 0 1 0]]
This is fine, but my situation is just a little bit different.
I want to extract the features the same way as above, but I don't want the rows in data
to be the same documents that the features were extracted from.
In other words, how can I get counts of another set of documents, say,
list_of_new_documents = [
["python, chicken"],
["linux, cow, ubuntu"],
["machine learning, bird, fish, pig"]
]
And get:
[[0 0 0 1 0 0]
[0 1 0 0 0 1]
[0 0 0 0 0 0]]
I have read the documentation for the CountVectorizer
class, and came across the vocabulary
argument, which is a mapping of terms to feature indices. I can't seem to get this argument to help me, however.
Any advice is appreciated.
PS: all credit due to Matthias Friedrich's Blog for the example I used above.
You're right that vocabulary
is what you want. It works like this:
>>> cv = sklearn.feature_extraction.text.CountVectorizer(vocabulary=['hot', 'cold', 'old'])
>>> cv.fit_transform(['pease porridge hot', 'pease porridge cold', 'pease porridge in the pot', 'nine days old']).toarray()
array([[1, 0, 0],
[0, 1, 0],
[0, 0, 0],
[0, 0, 1]], dtype=int64)
So you pass it a dict with your desired features as the keys.
If you used CountVectorizer
on one set of documents and then you want to use the set of features from those documents for a new set, use the vocabulary_
attribute of your original CountVectorizer and pass it to the new one. So in your example, you could do
newVec = CountVectorizer(vocabulary=vec.vocabulary_)
to create a new tokenizer using the vocabulary from your first one.