Performing PCA on large sparse matrix by using sklearn

khassan picture khassan · Nov 9, 2015 · Viewed 20.7k times · Source

I am trying to apply PCA on huge sparse matrix, in the following link it says that randomizedPCA of sklearn can handle sparse matrix of scipy sparse format. Apply PCA on very large sparse matrix

However, I always get error. Can someone point out what I am doing wrong.

Input matrix 'X_train' contains numbers in float64:

>>>type(X_train)
<class 'scipy.sparse.csr.csr_matrix'>
>>>X_train.shape
(2365436, 1617899)
>>>X_train.ndim 
2
>>>X_train[0]     
<1x1617899 sparse matrix of type '<type 'numpy.float64'>'
    with 81 stored elements in Compressed Sparse Row format>

I am trying to do:

>>>from sklearn.decomposition import RandomizedPCA
>>>pca = RandomizedPCA()
>>>pca.fit(X_train)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/RT11/.pyenv/versions/2.7.9/lib/python2.7/site-packages/sklearn/decomposition/pca.py", line 567, in fit
    self._fit(check_array(X))
  File "/home/RT11/.pyenv/versions/2.7.9/lib/python2.7/site-packages/sklearn/utils/validation.py", line 334, in check_array
    copy, force_all_finite)
  File "/home/RT11/.pyenv/versions/2.7.9/lib/python2.7/site-packages/sklearn/utils/validation.py", line 239, in _ensure_sparse_format
    raise TypeError('A sparse matrix was passed, but dense '
TypeError: A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array.

if I try to convert to dense matrix, I think I am out of memory.

>>> pca.fit(X_train.toarray())
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/RT11/.pyenv/versions/2.7.9/lib/python2.7/site-packages/scipy/sparse/compressed.py", line 949, in toarray
    return self.tocoo(copy=False).toarray(order=order, out=out)
  File "/home/RT11/.pyenv/versions/2.7.9/lib/python2.7/site-packages/scipy/sparse/coo.py", line 274, in toarray
    B = self._process_toarray_args(order, out)
  File "/home/RT11/.pyenv/versions/2.7.9/lib/python2.7/site-packages/scipy/sparse/base.py", line 800, in _process_toarray_args
    return np.zeros(self.shape, dtype=self.dtype, order=order)
MemoryError

Answer

Imanol Luengo picture Imanol Luengo · Nov 9, 2015

Due to the nature of the PCA, even if the input is an sparse matrix, the output is not. You can check it with a quick example:

>>> from sklearn.decomposition import TruncatedSVD
>>> from scipy import sparse as sp

Create a random sparse matrix with 0.01% of its data as non-zeros.

>>> X = sp.rand(1000, 1000, density=0.0001)

Apply PCA to it:

>>> clf = TruncatedSVD(100)
>>> Xpca = clf.fit_transform(X)

Now, check the results:

>>> type(X)
scipy.sparse.coo.coo_matrix
>>> type(Xpca)
numpy.ndarray
>>> print np.count_nonzero(Xpca), Xpca.size
95000, 100000

which suggests that 95000 of the entries are non-zero, however,

>>> np.isclose(Xpca, 0, atol=1e-15).sum(), Xpca.size
99481, 100000

99481 elements are close to 0 (<1e-15), but not 0.

Which means, in short, that for a PCA, even if the input is an sparse matrix, the output is not. Thus, if you try to extract 100,000,000 (1e8) components from your matrix, you will end up with a 1e8 x n_features (in your example 1e8 x 1617899) dense matrix, which of course, can't be hold in memory.

I'm not an expert statistician, but I believe there is currently no workaraound for this using scikit-learn, as is not a problem of scikit-learn's implementation, is just the mathematical definition of their Sparse PCA (by means of sparse SVD) which makes the result dense.

The only workaround that might work for you, is for you to start from a small amount of components, and increase it until you get a balance between the data that you can keep in memory, and the percentage of the data explained (which you can calculate as follows):

>>> clf.explained_variance_ratio_.sum()