Keep TFIDF result for predicting new content using Scikit for Python

Question 1

Keep TFIDF result for predicting new content using Scikit for Python

python machine-learning scikit-learn tf-idf

lol.Wen · Apr 22, 2015 · Viewed 27.9k times · Source

Answer

Answer

I successfully saved the feature list by saving vectorizer.vocabulary_, and reuse by CountVectorizer(decode_error="replace",vocabulary=vectorizer.vocabulary_)

Codes below:

corpus = np.array(["aaa bbb ccc", "aaa bbb ddd"])
vectorizer = CountVectorizer(decode_error="replace")
vec_train = vectorizer.fit_transform(corpus)
#Save vectorizer.vocabulary_
pickle.dump(vectorizer.vocabulary_,open("feature.pkl","wb"))

#Load it later
transformer = TfidfTransformer()
loaded_vec = CountVectorizer(decode_error="replace",vocabulary=pickle.load(open("feature.pkl", "rb")))
tfidf = transformer.fit_transform(loaded_vec.fit_transform(np.array(["aaa ccc eee"])))

That works. tfidf will have same feature length as trained data.

Question 2

I am using sklearn on Python to do some clustering. I've trained 200,000 data, and code below works well.

corpus = open("token_from_xml.txt")
vectorizer = CountVectorizer(decode_error="replace")
transformer = TfidfTransformer()
tfidf = transformer.fit_transform(vectorizer.fit_transform(corpus))
km = KMeans(30)
kmresult = km.fit(tfidf).predict(tfidf)

But when I have new testing content, I'd like to cluster it to existed clusters I'd trained. So I'm wondering how to save IDF result, so that I can do TFIDF for the new testing content and make sure the result for new testing content have same array length.

Thanks in advance.

UPDATE

I may need to save "transformer" or "tfidf" variable to file(txt or others), if one of them contains the trained IDF result.

UPDATE

For example. I have the training data:

["a", "b", "c"]
["a", "b", "d"]

And do TFIDF, the result will contains 4 features(a,b,c,d)

When I TEST:

["a", "c", "d"]

to see which cluster(already made by k-means) it belongs to. TFIDF will only give the result with 3 features(a,c,d), so the clustering in k-means will fall. (If I test ["a", "b", "e"], there may have other problems.)

So how to store the features list for testing data (even more, store it in file)?

UPDATE

Solved, see answers below.

Keep TFIDF result for predicting new content using Scikit for Python

Answer

Related questions