I'm looking to perform feature selection with a multi-label dataset using sklearn. I want to get the final set of features across labels, which I will then use in another machine learning package. I was planning to use the method I saw here, which selects relevant features for each label separately.
from sklearn.svm import LinearSVC
from sklearn.feature_selection import chi2, SelectKBest
from sklearn.multiclass import OneVsRestClassifier
clf = Pipeline([('chi2', SelectKBest(chi2, k=1000)),
('svm', LinearSVC())])
multi_clf = OneVsRestClassifier(clf)
I then plan to extract the indices of the included features, per label, using this:
selected_features = []
for i in multi_clf.estimators_:
selected_features += list(i.named_steps["chi2"].get_support(indices=True))
Now, my question is, how do I choose which selected features to include in my final model? I could use every unique feature (which would include features that were only relevant for one label), or I could do something to select features that were relevant for more labels.
My initial idea is to create a histogram of the number of labels a given feature was selected for, and to identify a threshold based on visual inspection. My concern is that this method is subjective. Is there a more principled way of performing feature selection for multilabel datasets using sklearn?
According to the conclusions in this paper:
[...] rank features according to the average or the maximum Chi-squared score across all labels, led to most of the best classifiers while using less features.
Then, in order to select a good subset of features you just need to do (something like) this:
from sklearn.feature_selection import chi2, SelectKBest
selected_features = []
for label in labels:
selector = SelectKBest(chi2, k='all')
selector.fit(X, Y[label])
selected_features.append(list(selector.scores_))
// MeanCS
selected_features = np.mean(selected_features, axis=0) > threshold
// MaxCS
selected_features = np.max(selected_features, axis=0) > threshold
Note: in the code above I'm assuming that X is the output of some text vectorizer (the vectorized version of the texts) and Y is a pandas dataframe with one column per label (so I can select the column Y[label]
). Also, there is a threshold variable that should be fixed beforehand.