The SciPy Sparse Matrix tutorial is very good -- but it actually leaves the section on slicing un(der)developed (still in outline form -- see section: "Handling Sparse Matrices").
I will try and update the tutorial, once this question is answered.
I have a large sparse matrix -- currently in dok_matrix format.
import numpy as np
from scipy import sparse
M = sparse.dok_matrix((10**6, 10**6))
For various methods I want to be able to slice columns and for others I want to slice rows. Ideally I would use advanced-indexing (i.e. a boolean vector, bool_vect
) with which to slice a sparse matrix M
-- as in:
bool_vect = np.arange(10**6)%2 # every even index
out = M[bool_vect,:] # Want to select every even row
or
out = M[:,bool_vect] # Want to select every even column
First off, dok_matrices do not support this -- but I think it works (slowly) if I first cast to lil_matrices, via sparse.lil_matrix(M)
As far as I can gather from the tutorial -- to slice columns I want to use CSC and to slice rows I want to slice CSR. So does that mean I should cast the matrix M
via:
M.tocsc()[:,bool_vect]
or
M.tocsr()[bool_vect,:]
I am kinda guessing here and my code is slow because of it. Any help from someone who understands how this works would be appreciated. Thanks in advance.
If it turns out I should not be indexing my matrix with a boolean array, but rather a list of integers (indices) -- that is also something I would be happy to find out. Whichever is more efficient.
Finally -- this is a big matrix, so bonus points if this can happen in place / with broadcasting.
Ok, so I'm pretty sure the "right" way to do this is: if you are slicing columns, use tocsc() and slice using a list/array of integers. Boolean vectors does not seem to do the trick with sparse matrices -- the way it does with ndarrays in numpy. Which means the answer is.
indices = np.where(bool_vect)[0]
out1 = M.tocsc()[:,indices]
out2 = M.tocsr()[indices,:]
But question: is this the best way? Is this in place?
In practice this does seem to be happening in place -- and it is much faster than prior attempts (using lil_matrix).