Pandas sparse dataFrame to sparse matrix, without generating a dense matrix in memory

Jake0x32 picture Jake0x32 · Jun 27, 2015 · Viewed 10.2k times · Source

Is there a way to convert from a pandas.SparseDataFrame to scipy.sparse.csr_matrix, without generating a dense matrix in memory?

scipy.sparse.csr_matrix(df.values)

doesn't work as it generates a dense matrix which is cast to the csr_matrix.

Thanks in advance!

Answer

T.C. Proctor picture T.C. Proctor · Jul 2, 2016

Pandas 0.20.0+:

As of pandas version 0.20.0, released May 5, 2017, there is a one-liner for this:

from scipy import sparse


def sparse_df_to_csr(df):
    return sparse.csr_matrix(df.to_coo())

This uses the new to_coo() method.

Earlier Versions:

Building on Victor May's answer, here's a slightly faster implementation, but it only works if the entire SparseDataFrame is sparse with all BlockIndex (note: if it was created with get_dummies, this will be the case).

Edit: I modified this so it will work with a non-zero fill value. CSR has no native non-zero fill value, so you will have to record it externally.

import numpy as np
import pandas as pd
from scipy import sparse

def sparse_BlockIndex_df_to_csr(df):
    columns = df.columns
    zipped_data = zip(*[(df[col].sp_values - df[col].fill_value,
                         df[col].sp_index.to_int_index().indices)
                        for col in columns])
    data, rows = map(list, zipped_data)
    cols = [np.ones_like(a)*i for (i,a) in enumerate(data)]
    data_f = np.concatenate(data)
    rows_f = np.concatenate(rows)
    cols_f = np.concatenate(cols)
    arr = sparse.coo_matrix((data_f, (rows_f, cols_f)),
                            df.shape, dtype=np.float64)
    return arr.tocsr()