I have fitted a CountVectorizer
to some documents in scikit-learn
. I would like to see all the terms and their corresponding frequency in the text corpus, in order to select stop-words. For example
'and' 123 times, 'to' 100 times, 'for' 90 times, ... and so on
Is there any built-in function for this?
If cv
is your CountVectorizer
and X
is the vectorized corpus, then
zip(cv.get_feature_names(),
np.asarray(X.sum(axis=0)).ravel())
returns a list of (term, frequency)
pairs for each distinct term in the corpus that the CountVectorizer
extracted.
(The little asarray
+ ravel
dance is needed to work around some quirks in scipy.sparse
.)