When choosing the number of principal components (k), we choose k to be the smallest value so that for example, 99% of variance, is retained.
However, in the Python Scikit learn, I am not 100% sure pca.explained_variance_ratio_ = 0.99
is equal to "99% of variance is retained"? Could anyone enlighten? Thanks.
Yes, you are nearly right. The pca.explained_variance_ratio_
parameter returns a vector of the variance explained by each dimension. Thus pca.explained_variance_ratio_[i]
gives the variance explained solely by the i+1st dimension.
You probably want to do pca.explained_variance_ratio_.cumsum()
. That will return a vector x
such that x[i]
returns the cumulative variance explained by the first i+1 dimensions.
import numpy as np
from sklearn.decomposition import PCA
np.random.seed(0)
my_matrix = np.random.randn(20, 5)
my_model = PCA(n_components=5)
my_model.fit_transform(my_matrix)
print my_model.explained_variance_
print my_model.explained_variance_ratio_
print my_model.explained_variance_ratio_.cumsum()
[ 1.50756565 1.29374452 0.97042041 0.61712667 0.31529082]
[ 0.32047581 0.27502207 0.20629036 0.13118776 0.067024 ]
[ 0.32047581 0.59549787 0.80178824 0.932976 1. ]
So in my random toy data, if I picked k=4
I would retain 93.3% of the variance.