I'm working turning a list of records with two columns (A and B) into a matrix representation. I have been using the pivot function within pandas, but the result ends up being fairly large. Does pandas support pivoting into a sparse format? I know I can pivot it and then turn it into some kind of sparse representation, but isn't as elegant as I would like. My end goal is to use it as the input for a predictive model.
Alternatively, is there some kind of sparse pivot capability outside of pandas?
edit: here is an example of a non-sparse pivot
import pandas as pd
person thing count
0 me a 1
1 you a 1
2 him b 1
3 you c 1
4 him d 1
5 me d 1
thing a b c d
him NaN 1 NaN 1
me 1 NaN NaN 1
you 1 NaN 1 NaN
This creates a matrix that could contain all possible combinations of persons and things, but it is not sparse.
Sparse matrices take up less space because they can imply things like NaN or 0. If I have a very large data set, this pivoting function can generate a matrix that should be sparse due to the large number of NaNs or 0s. I was hoping that I could save a lot of space/memory by generating something that was sparse right off the bat rather than creating a dense matrix and then converting it to sparse.
Here is a method that creates a sparse scipy matrix based on data and indices of person and thing. person_u
and thing_u
are lists representing the unique entries for your rows and columns of pivot you want to create. Note: this assumes that your count column already has the value you want in it.
from scipy.sparse import csr_matrix
person_u = list(sort(frame.person.unique()))
thing_u = list(sort(frame.thing.unique()))
data = frame['count'].tolist()
row = frame.person.astype('category', categories=person_u).cat.codes
col = frame.thing.astype('category', categories=thing_u).cat.codes
sparse_matrix = csr_matrix((data, (row, col)), shape=(len(person_u), len(thing_u)))
>>> sparse_matrix
<3x4 sparse matrix of type '<type 'numpy.int64'>'
with 6 stored elements in Compressed Sparse Row format>
>>> sparse_matrix.todense()
matrix([[0, 1, 0, 1],
[1, 0, 0, 1],
[1, 0, 1, 0]])
Based on your original question, the scipy sparse matrix should be sufficient for your needs, but should you wish to have a sparse dataframe you can do the following:
dfs=pd.SparseDataFrame([ pd.SparseSeries(sparse_matrix[i].toarray().ravel(), fill_value=0)
for i in np.arange(sparse_matrix.shape[0]) ], index=person_u, columns=thing_u, default_fill_value=0)
>>> dfs
a b c d
him 0 1 0 1
me 1 0 0 1
you 1 0 1 0
>>> type(dfs)