I was reading about TfidfVectorizer implementation of scikit-learn, i don´t understand what´s the output of the method, for example:
new_docs = ['He watches basketball and baseball', 'Julie likes to play basketball', 'Jane loves to play baseball']
new_term_freq_matrix = tfidf_vectorizer.transform(new_docs)
print tfidf_vectorizer.vocabulary_
print new_term_freq_matrix.todense()
output:
{u'me': 8, u'basketball': 1, u'julie': 4, u'baseball': 0, u'likes': 5, u'loves': 7, u'jane': 3, u'linda': 6, u'more': 9, u'than': 10, u'he': 2}
[[ 0.57735027 0.57735027 0.57735027 0. 0. 0. 0.
0. 0. 0. 0. ]
[ 0. 0.68091856 0. 0. 0.51785612 0.51785612
0. 0. 0. 0. 0. ]
[ 0.62276601 0. 0. 0.62276601 0. 0. 0.
0.4736296 0. 0. 0. ]]
What is?(e.g.: u'me': 8 ):
{u'me': 8, u'basketball': 1, u'julie': 4, u'baseball': 0, u'likes': 5, u'loves': 7, u'jane': 3, u'linda': 6, u'more': 9, u'than': 10, u'he': 2}
is this a matrix or just a vector?, i can´t understand what´s telling me the output:
[[ 0.57735027 0.57735027 0.57735027 0. 0. 0. 0.
0. 0. 0. 0. ]
[ 0. 0.68091856 0. 0. 0.51785612 0.51785612
0. 0. 0. 0. 0. ]
[ 0.62276601 0. 0. 0.62276601 0. 0. 0.
0.4736296 0. 0. 0. ]]
Could anybody explain me in more detail these outputs?
Thanks!
TfidfVectorizer - Transforms text to feature vectors that can be used as input to estimator.
vocabulary_
Is a dictionary that converts each token (word) to feature index in the matrix, each unique token gets a feature index.
What is?(e.g.: u'me': 8 )
It tells you that the token 'me' is represented as feature number 8 in the output matrix.
is this a matrix or just a vector?
Each sentence is a vector, the sentences you've entered are matrix with 3 vectors. In each vector the numbers (weights) represent features tf-idf score. For example: 'julie': 4 --> Tells you that the in each sentence 'Julie' appears you will have non-zero (tf-idf) weight. As you can see in the 2'nd vector:
[ 0. 0.68091856 0. 0. 0.51785612 0.51785612 0. 0. 0. 0. 0. ]
The 5'th element scored 0.51785612 - the tf-idf score for 'Julie'. For more info about Tf-Idf scoring read here: http://en.wikipedia.org/wiki/Tf%E2%80%93idf