I am working on a script using the Lending Club API to predict whether a loan will "pay in full" or "charge off". To do this I am using scikit-learn to build the model and persisted using joblib. I run into a ValueError due to a difference between the number of columns in the persisted model and the number of columns from new raw data. The ValueError is caused from creating dummy variables for categorical variables. The number of columns used in the model is 84 and in this example the number of columns using the new data is 29.
The number of columns needs to be 84 for the new data when making dummy variables but I am not sure how to proceed since only a subset of all possible values for the categorical variables 'homeOwnership','addrState', and 'purpose' are present when obtaining new data from the API.
Here's the code I am testing at the moment starting at the point where the categorical variables are transformed into dummy variables and stopping at model implementation.
#......continued
df['mthsSinceLastDelinq'].notnull().astype('int')
df['mthsSinceLastRecord'].notnull().astype('int')
df['grade_num'] = df['grade'].map({'A':0,'B':1,'C':2,'D':3})
df['emp_length_num'] = df['empLength']
df = pd.get_dummies(df,columns=['homeOwnership','addrState','purpose'])
# df = pd.get_dummies(df,columns=['home_ownership','addr_state','verification_status','purpose'])
# step 3.5 transform data before making predictions
df.drop(['id','grade','empLength','isIncV'],axis=1,inplace=True)
dfbcd = df[df['grade_num'] != 0]
scaler = StandardScaler()
x_scbcd = scaler.fit_transform(dfbcd)
# step 4 predicting
lrbcd_test = load('lrbcd_test.joblib')
ypredbcdfinal = lrbcd_test.predict(x_scbcd)
Here's the error message
ValueError Traceback (most recent call last)
<ipython-input-239-c99611b2e48a> in <module>
11 # change name of model and file name
12 lrbcd_test = load('lrbcd_test.joblib')
---> 13 ypredbcdfinal = lrbcd_test.predict(x_scbcd)
14
15 #add model
~\Anaconda3\lib\site-packages\sklearn\linear_model\base.py in predict(self, X)
287 Predicted class label per sample.
288 """
--> 289 scores = self.decision_function(X)
290 if len(scores.shape) == 1:
291 indices = (scores > 0).astype(np.int)
~\Anaconda3\lib\site-packages\sklearn\linear_model\base.py in decision_function(self, X)
268 if X.shape[1] != n_features:
269 raise ValueError("X has %d features per sample; expecting %d"
--> 270 % (X.shape[1], n_features))
271
272 scores = safe_sparse_dot(X, self.coef_.T,
ValueError: X has 29 features per sample; expecting 84
Your new data should have the same exact columns as the data that you used to train and persist your original model. And if the number of unique values of the categorical variables is lesser in the newer data, manually add columns for those variables after doing pd.get_dummies() and set them to zero for all the data points.
The model will work only when it gets the required number of columns. If pd.get_dummies fails to create all those columns on the newer data, you should do it manually.
Edit
If you want to automatically insert the missing columns after the pd.get_dummies() step, you can use the following approach. Assuming that df_newdata is the dataframe after applying pd.get_dummies() tot he new dataset and df_olddata is the df that you got when you applied pd.get_dummies() on the older dataset(which was used for training), you can simply do this:
df_newdata = df_newdata.reindex(labels=df_olddata.columns,axis=1)
This will automatically create the missing columns in df_newdata (in comparison to df_olddata) and set the values of these columns to NaN for all the rows. So now, your new dataframe has the same exct columns as the original dataframe had.
Hope this helps