I'm working with scikit learn on a text classification experiment. Now I would like to get the names of the best performing, selected features. I tried some of the answers to similar questions, but nothing works. The last lines of code are an example of what I tried. For example when I print feature_names
, I get this error: sklearn.exceptions.NotFittedError: This SelectKBest instance is not fitted yet. Call 'fit' with appropriate arguments before using this method.
Any solutions?
scaler = StandardScaler(with_mean=False)
enc = LabelEncoder()
y = enc.fit_transform(labels)
feat_sel = SelectKBest(mutual_info_classif, k=200)
clf = linear_model.LogisticRegression()
pipe = Pipeline([('vectorizer', DictVectorizer()),
('scaler', StandardScaler(with_mean=False)),
('mutual_info', feat_sel),
('logistregress', clf)])
feature_names = pipe.named_steps['mutual_info']
X.columns[features.transform(np.arange(len(X.columns)))]
You first have to fit the pipeline and then call feature_names
:
Solution
scaler = StandardScaler(with_mean=False)
enc = LabelEncoder()
y = enc.fit_transform(labels)
feat_sel = SelectKBest(mutual_info_classif, k=200)
clf = linear_model.LogisticRegression()
pipe = Pipeline([('vectorizer', DictVectorizer()),
('scaler', StandardScaler(with_mean=False)),
('mutual_info', feat_sel),
('logistregress', clf)])
# Now fit the pipeline using your data
pipe.fit(X, y)
#now can the pipe.named_steps
feature_names = pipe.named_steps['mutual_info']
X.columns[features.transform(np.arange(len(X.columns)))]
General information
From the documentation example here you can see the
anova_svm.set_params(anova__k=10, svc__C=.1).fit(X, y)
This sets some initial parameters (k parameter for anova and C parameter for svc)
and then calls fit(X,y)
to fit the pipeline.
EDIT:
for the new error, since your X is a list of dictionaries I see one way to call the columns method that you want. This can be done using pandas.
X= [{'age': 10, 'name': 'Tom'}, {'age': 5, 'name': 'Mark'}]
df = DataFrame(X)
len(df.columns)
result:
2
Hope this helps