Find the tf-idf score of specific words in documents using sklearn

WhiteTiger picture WhiteTiger · Jun 22, 2015 · Viewed 9.8k times · Source

I have code that runs basic TF-IDF vectorizer on a collection of documents, returning a sparse matrix of D X F where D is the number of documents and F is the number of terms. No problem.

But how do I find the TF-IDF score of a specific term in the document? i.e. is there some sort of dictionary between terms (in their textual representation) and their position in the resulting sparse matrix?

Answer

Ryan picture Ryan · Jun 22, 2015

Yes. See .vocabulary_ on your fitted/transformed TF-IDF vectorizer.

In [1]: from sklearn.datasets import fetch_20newsgroups

In [2]: data = fetch_20newsgroups(categories=['rec.autos'])

In [3]: from sklearn.feature_extraction.text import TfidfVectorizer

In [4]: cv = TfidfVectorizer()

In [5]: X = cv.fit_transform(data.data)

In [6]: cv.vocabulary_

It is a dictionary of the form:

{word : column index in array}