ValueError: X has 29 features per sample; expecting 84

jz451 picture jz451 · Aug 10, 2019 · Viewed 7.8k times · Source

I am working on a script using the Lending Club API to predict whether a loan will "pay in full" or "charge off". To do this I am using scikit-learn to build the model and persisted using joblib. I run into a ValueError due to a difference between the number of columns in the persisted model and the number of columns from new raw data. The ValueError is caused from creating dummy variables for categorical variables. The number of columns used in the model is 84 and in this example the number of columns using the new data is 29.

The number of columns needs to be 84 for the new data when making dummy variables but I am not sure how to proceed since only a subset of all possible values for the categorical variables 'homeOwnership','addrState', and 'purpose' are present when obtaining new data from the API.

Here's the code I am testing at the moment starting at the point where the categorical variables are transformed into dummy variables and stopping at model implementation.

#......continued

df['mthsSinceLastDelinq'].notnull().astype('int')
df['mthsSinceLastRecord'].notnull().astype('int')
df['grade_num'] = df['grade'].map({'A':0,'B':1,'C':2,'D':3})
df['emp_length_num'] = df['empLength']
df = pd.get_dummies(df,columns=['homeOwnership','addrState','purpose'])
# df = pd.get_dummies(df,columns=['home_ownership','addr_state','verification_status','purpose'])

# step 3.5 transform data before making predictions

df.drop(['id','grade','empLength','isIncV'],axis=1,inplace=True)
dfbcd = df[df['grade_num'] != 0]
scaler = StandardScaler()
x_scbcd = scaler.fit_transform(dfbcd)

# step 4 predicting

lrbcd_test = load('lrbcd_test.joblib')
ypredbcdfinal = lrbcd_test.predict(x_scbcd)

Here's the error message

ValueError                                Traceback (most recent call last)
<ipython-input-239-c99611b2e48a> in <module>
     11 # change name of model and file name
     12 lrbcd_test = load('lrbcd_test.joblib')
---> 13 ypredbcdfinal = lrbcd_test.predict(x_scbcd)
     14 
     15     #add model

~\Anaconda3\lib\site-packages\sklearn\linear_model\base.py in predict(self, X)
    287             Predicted class label per sample.
    288         """
--> 289         scores = self.decision_function(X)
    290         if len(scores.shape) == 1:
    291             indices = (scores > 0).astype(np.int)

~\Anaconda3\lib\site-packages\sklearn\linear_model\base.py in decision_function(self, X)
    268         if X.shape[1] != n_features:
    269             raise ValueError("X has %d features per sample; expecting %d"
--> 270                              % (X.shape[1], n_features))
    271 
    272         scores = safe_sparse_dot(X, self.coef_.T,

ValueError: X has 29 features per sample; expecting 84

Answer

Amit picture Amit · Aug 12, 2019

Your new data should have the same exact columns as the data that you used to train and persist your original model. And if the number of unique values of the categorical variables is lesser in the newer data, manually add columns for those variables after doing pd.get_dummies() and set them to zero for all the data points.

The model will work only when it gets the required number of columns. If pd.get_dummies fails to create all those columns on the newer data, you should do it manually.

Edit

If you want to automatically insert the missing columns after the pd.get_dummies() step, you can use the following approach. Assuming that df_newdata is the dataframe after applying pd.get_dummies() tot he new dataset and df_olddata is the df that you got when you applied pd.get_dummies() on the older dataset(which was used for training), you can simply do this:

df_newdata = df_newdata.reindex(labels=df_olddata.columns,axis=1)

This will automatically create the missing columns in df_newdata (in comparison to df_olddata) and set the values of these columns to NaN for all the rows. So now, your new dataframe has the same exct columns as the original dataframe had.

Hope this helps