How to use scikit-learn PCA for features reduction and know which features are discarded

gc5 picture gc5 · Apr 25, 2014 · Viewed 33k times · Source

I am trying to run a PCA on a matrix of dimensions m x n where m is the number of features and n the number of samples.

Suppose I want to preserve the nf features with the maximum variance. With scikit-learn I am able to do it in this way:

from sklearn.decomposition import PCA

nf = 100
pca = PCA(n_components=nf)
# X is the matrix transposed (n samples on the rows, m features on the columns)
pca.fit(X)

X_new = pca.transform(X)

Now, I get a new matrix X_new that has a shape of n x nf. Is it possible to know which features have been discarded or the retained ones?

Thanks

Answer

eickenberg picture eickenberg · Apr 25, 2014

The features that your PCA object has determined during fitting are in pca.components_. The vector space orthogonal to the one spanned by pca.components_ is discarded.

Please note that PCA does not "discard" or "retain" any of your pre-defined features (encoded by the columns you specify). It mixes all of them (by weighted sums) to find orthogonal directions of maximum variance.

If this is not the behaviour you are looking for, then PCA dimensionality reduction is not the way to go. For some simple general feature selection methods, you can take a look at sklearn.feature_selection