I am using PCA to reduce the dimensionality of a N-dimensional dataset, but I want to build in robustness to large outliers, so I've been looking into Robust PCA codes.
For traditional PCA, I'm using python's sklearn.decomposition.PCA which nicely returns the principal components as vectors, onto which I can then project my data (to be clear, I've also coded my own versions using SVD so I know how the method works). I found a few pre-coded RPCA python codes out there (like https://github.com/dganguli/robust-pca and https://github.com/jkarnows/rpcaADMM).
The 1st code is based on the Candes et al. (2009) method, and returns low rank L and sparse S matrices for a dataset D. The 2nd code uses the ADMM method of matrix decomposition (Parikh, N., & Boyd, S. 2013) and returns X_1, X_2, X_3 matrices. I must admit, I'm having a very hard time figuring out how to connect these to the principal axes that are returned by a standard PCM algorithm. Can anyone provide any guidance?
Specifically, in one dataset X, I have a cloud of N 3-D points. I run it through PCA:
pca=sklean.decompose.PCA(n_components=3)
pca.fit(X)
comps=pca.components_
and these 3 components are 3-D vectors define the new basis onto which I project all my points. With Robust PCA, I get matrices L+S=X. Does one then run pca.fit(L)? I would have thought that RPCA would have given me back the eigenvectors but have internal steps to throw out outliers as part of building the covariance matrix or performing SVD.
Maybe what I think of as "Robust PCA" isn't how other people are using/coding it?
The robust-pca
code factors the data matrix D
into two matrices, L
and S
which are "low-rank" and "sparse" matrices (see the paper for details). L
is what's mostly constant between the various observations, while S
is what varies. Figures 2 and 3 in the paper give a really nice example from a couple of security cameras, picking out the static background (L
) and variability such as passing people (S
).
If you just want the eigenvectors, treat the S
as junk (the "large outliers" you're wanting to clip out) and do an eigenanalysis on the L
matrix.
Here's an example using the robust-pca
code:
L, S = RPCA(data).fit()
rcomp, revals, revecs = pca(L)
print("Normalised robust eigenvalues: %s" % (revals/np.sum(revals),))
Here, the pca
function is:
def pca(data, numComponents=None):
"""Principal Components Analysis
From: http://stackoverflow.com/a/13224592/834250
Parameters
----------
data : `numpy.ndarray`
numpy array of data to analyse
numComponents : `int`
number of principal components to use
Returns
-------
comps : `numpy.ndarray`
Principal components
evals : `numpy.ndarray`
Eigenvalues
evecs : `numpy.ndarray`
Eigenvectors
"""
m, n = data.shape
data -= data.mean(axis=0)
R = np.cov(data, rowvar=False)
# use 'eigh' rather than 'eig' since R is symmetric,
# the performance gain is substantial
evals, evecs = np.linalg.eigh(R)
idx = np.argsort(evals)[::-1]
evecs = evecs[:,idx]
evals = evals[idx]
if numComponents is not None:
evecs = evecs[:, :numComponents]
# carry out the transformation on the data using eigenvectors
# and return the re-scaled data, eigenvalues, and eigenvectors
return np.dot(evecs.T, data.T).T, evals, evecs