from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
from sklearn import linear_model
arr=['dogs cats lions','apple pineapple orange','water fire earth air', 'sodium potassium calcium']
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(arr)
feature_names = vectorizer.get_feature_names()
Y = ['animals', 'fruits', 'elements','chemicals']
T=["eating apple roasted in fire and enjoying fresh air"]
test = vectorizer.transform(T)
clf = linear_model.SGDClassifier(loss='log')
clf.fit(X,Y)
x=clf.predict(test)
#prints: elements
In the above code, clf.predict()
prints only 1 best prediction for a sample from list X.
I am interested in top 3 predictions for a particular sample in the list X, i know the function predict_proba
/predict_log_proba
returns a list of all probabilities for each feature in list Y, but it has to sorted and then associated with the features in list Y before getting the top 3 results.
Is there any direct and efficient way?
There is no built-in function, but what is wrong with
probs = clf.predict_proba(test)
best_n = np.argsort(probs, axis=1)[-n:]
As suggested by one of the comment, should change [-n:]
to [:,-n:]
probs = clf.predict_proba(test)
best_n = np.argsort(probs, axis=1)[:,-n:]