I am using input data from here (see Section 3.1).
I am trying to reproduce their covariance matrix, eigenvalues, and eigenvectors using scikit-learn. However, I am unable to reproduce the results as presented in the data source. I've also seen this input data elsewhere but I can't discern whether it's a problem with scikit-learn, my steps, or the data source.
data = np.array([[2.5,2.4],
[0.5,0.7],
[2.2,2.9],
[1.9,2.2],
[3.1,3.0],
[2.3,2.7],
[2.0,1.6],
[1.0,1.1],
[1.5,1.6],
[1.1,0.9],
])
centered_data = data-data.mean(axis=0)
pca = PCA()
pca.fit(centered_data)
print(pca.get_covariance()) #Covariance Matrix
array([[ 0.5549, 0.5539],
[ 0.5539, 0.6449]])
print(pca.explained_variance_ratio_) #Eigenvalues (normalized)
[ 0.96318131 0.03681869]
print(pca.components_) #Eigenvectors
[[-0.6778734 -0.73517866]
[ 0.73517866 -0.6778734 ]]
Surprisingly, the projections matches the results from the data source described above.
print(pca.transform(centered_data)) #Projections
array([[-0.82797019, 0.17511531],
[ 1.77758033, -0.14285723],
[-0.99219749, -0.38437499],
[-0.27421042, -0.13041721],
[-1.67580142, 0.20949846],
[-0.9129491 , -0.17528244],
[ 0.09910944, 0.3498247 ],
[ 1.14457216, -0.04641726],
[ 0.43804614, -0.01776463],
[ 1.22382056, 0.16267529]])
Here is what I don't understand:
Correct covariance matrix of this data:
numpy.cov(data.transpose())
array([[ 0.61655556, 0.61544444], [ 0.61544444, 0.71655556]])
Biased (i.e. "incorrect", using the wrong normalization term, and underestimating variance in the data set) covariance matrix:
numpy.cov(data.transpose(), bias=1)
array([[ 0.5549, 0.5539], [ 0.5539, 0.6449]])
Numpy knows that you have to center your data - so you don't need centered_data
.
PCA components are not 1:1 the eigenvalues.
Correct eigenvalue decomposition:
numpy.linalg.eig(numpy.cov(data.transpose()))
(array([ 0.0490834 , 1.28402771]), array([[-0.73517866, -0.6778734 ], [ 0.6778734 , -0.73517866]]))
Using the biased estimator yields different Eigenvalues (again, underestimating variance), but the same Eigenvectors:
(array([ 0.04417506, 1.15562494]), ...
Note that the Eigenvectors are not yet sorted by the largest Eigenvalues.
As the name of pca.explained_variance_ratio_
indicates, these are not the Eigenvalues. They are the ratio. If we take the (biased, underestimating) eigenvalues, and normalize them to have a sum of 1, we get
s/sum(s)
array([ 0.03681869, 0.96318131])
Also, the pca.transform
method of scipy apparently does not apply scaling. IMHO, when using PCA, it is also fairly common to scale each component to have unit variance. This obviously does not hold for this output. Then the result would be (with the two columns swapped, I did not bother to change this)
s, e = numpy.linalg.eig(numpy.cov(data.transpose()))
o=numpy.argsort(s)[::-1]
(data-mean).dot(e[:,o]) / numpy.sqrt(s[o])
array([[-0.73068047, -0.79041795], [ 1.56870773, 0.64481466], [-0.87561043, 1.73495337], [-0.24198963, 0.58866414], [-1.47888824, -0.94561319], [-0.80567404, 0.79117236], [ 0.08746369, -1.57900372], [ 1.01008049, 0.20951358], [ 0.38657401, 0.08018421], [ 1.08001688, -0.73426743]])
(As you can see, PCA is just three lines in numpy
, so you don't need a function for this.)
Why do I think this is the proper result? Because the resulting data set has the property that it's covariance matrix is (except for rounding errors) the identity matrix.
Without scaling, the covariance matrix is numpy.diag(s[o])
. But one may also argue that by applying the scaling, I "lost" the variance information, that would have been kept otherwise.
scipy
is using the wrong (biased) covariance. numpy
is correct.But more often than not, it does not matter much. In above ratio, the bias cancels out. And if you have a large data set, the difference between using the naive 1/n
and the unbiased 1/(n-1)
eventually becomes neglibile. But also the difference comes at effectively zero CPU cost, so you might as well use the unbiased variance estimate.