knn imputation of categorical variables in python

KINNI picture KINNI · Apr 20, 2017 · Viewed 9.8k times · Source

I am trying to implement kNN from the fancyimpute module on a dataset. I was able to implement the code for continuous variables of the datasets using the code below:

knn_impute2=KNN(k=3).complete(train[['LotArea','LotFrontage']]) 

It yields the desirable answer as follows: This show how the original dataset looks like and how it has changed using knn imputation

I tried to implement the same code for categorical datasets and I get error :

could not convert string to float: 'female'

Here is the code I used(I am trying to use Imputer):

from sklearn.preprocessing import Imputer
imp = Imputer(missing_values='NaN', strategy='most_frequent', axis=0)
imp.fit(df['sex'])
print(imp.transform(df['sex']))

What am I doing wrong?

Recap, I want to use knn imputation on this dataset to impute the sex columns. Below is the dataset.

The dataset i want to impute using knn imputation with k value 2

How can i do that with knnimpute or i need to write my own functions. If yes, can anyone help me. Thnks

Answer

KINNI picture KINNI · Apr 26, 2017

I was able to impute the categorical variables using the steps listed below. I will gladly welcome any omissions or program that can perform such tasks automatically

Step1: Subsets the object's data types(all) into another container

Step2: Change np.NaN into an object data type, say None. Now, the container is made up of only objects data types

Step3: Change the entire container into categorical datasets

Step4: Encode the data set(i am using .cat.codes)

Step5: Change back the value of encoded None into np.NaN

Step5: Use KNN (from fancyimpute) to impute the missing values

Step6: Re-map the encoded dataset to its initial names