I have a large pandas data fame df
. It has quite a few missings. Dropping row/or col-wise is not an option. Imputing medians, means or the most frequent values is not an option either (hence imputation with pandas
and/or scikit
unfortunately doens't do the trick).
I came across what seems to be a neat package called fancyimpute
(you can find it here). But I have some problems with it.
Here is what I do:
#the neccesary imports
import pandas as pd
import numpy as np
from fancyimpute import KNN
# df is my data frame with the missings. I keep only floats
df_numeric = = df.select_dtypes(include=[np.float])
# I now run fancyimpute KNN,
# it returns a np.array which I store as a pandas dataframe
df_filled = pd.DataFrame(KNN(3).complete(df_numeric))
However, df_filled
is a single vector somehow, instead of the filled data frame. How do I get a hold of the data frame with imputations?
I realized, fancyimpute
needs a numpay array
. I hence converted the df_numeric
to a an array using as_matrix()
.
# df is my data frame with the missings. I keep only floats
df_numeric = df.select_dtypes(include=[np.float]).as_matrix()
# I now run fancyimpute KNN,
# it returns a np.array which I store as a pandas dataframe
df_filled = pd.DataFrame(KNN(3).complete(df_numeric))
The output is a dataframe with the column labels gone missing. Any way to retrieve the labels?
Add the following lines after your code:
df_filled.columns = df_numeric.columns
df_filled.index = df_numeric.index