Okay, so I have been following these two posts on TF*IDF but am little confused : http://css.dzone.com/articles/machine-learning-text-feature
Basically, I want to create a search query that contains searches through multiple documents. I would like to use the scikit-learn toolkit as well as the NLTK library for Python
The problem is that I don't see where the two TF*IDF vectors come from. I need one search query and multiple documents to search. I figured that I calculate the TF*IDF scores of each document against each query and find the cosine similarity between them, and then rank them by sorting the scores in descending order. However, the code doesn't seem to come up with the right vectors.
Whenever I reduce the query to only one search, it is returning a huge list of 0's which is really strange.
Here is the code:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from nltk.corpus import stopwords
train_set = ("The sky is blue.", "The sun is bright.") #Documents
test_set = ("The sun in the sky is bright.") #Query
stopWords = stopwords.words('english')
vectorizer = CountVectorizer(stop_words = stopWords)
transformer = TfidfTransformer()
trainVectorizerArray = vectorizer.fit_transform(train_set).toarray()
testVectorizerArray = vectorizer.transform(test_set).toarray()
print 'Fit Vectorizer to train set', trainVectorizerArray
print 'Transform Vectorizer to test set', testVectorizerArray
transformer.fit(trainVectorizerArray)
print transformer.transform(trainVectorizerArray).toarray()
transformer.fit(testVectorizerArray)
tfidf = transformer.transform(testVectorizerArray)
print tfidf.todense()
You're defining train_set
and test_set
as tuples, but I think that they should be lists:
train_set = ["The sky is blue.", "The sun is bright."] #Documents
test_set = ["The sun in the sky is bright."] #Query
Using this the code seems to run fine.