Predicting missing values with scikit-learn's Imputer module

xennygrimmato picture xennygrimmato · Jul 29, 2014 · Viewed 44.1k times · Source

I am writing a very basic program to predict missing values in a dataset using scikit-learn's Imputer class.

I have made a NumPy array, created an Imputer object with strategy='mean' and performed fit_transform() on the NumPy array.

When I print the array after performing fit_transform(), the 'Nan's remain, and I dont get any prediction.

What am I doing wrong here? How do I go about predicting the missing values?

import numpy as np
from sklearn.preprocessing import Imputer

X = np.array([[23.56],[53.45],['NaN'],[44.44],[77.78],['NaN'],[234.44],[11.33],[79.87]])

print X

imp = Imputer(missing_values='NaN', strategy='mean', axis=0)
imp.fit_transform(X)

print X

Answer

jonrsharpe picture jonrsharpe · Jul 29, 2014

Per the documentation, sklearn.preprocessing.Imputer.fit_transform returns a new array, it doesn't alter the argument array. The minimal fix is therefore:

X = imp.fit_transform(X)