I'm following one of the kernels on Kaggle, mainly, I'm following A kernel for Credit Card Fraud Detection.
I reached the step where I need to perform KFold in order to find the best parameters for Logistic Regression.
The following code is shown in the kernel itself, but for some reason (probably older version of scikit-learn, give me some errors).
def printing_Kfold_scores(x_train_data,y_train_data):
fold = KFold(len(y_train_data),5,shuffle=False)
# Different C parameters
c_param_range = [0.01,0.1,1,10,100]
results_table = pd.DataFrame(index = range(len(c_param_range),2), columns = ['C_parameter','Mean recall score'])
results_table['C_parameter'] = c_param_range
# the k-fold will give 2 lists: train_indices = indices[0], test_indices = indices[1]
j = 0
for c_param in c_param_range:
print('-------------------------------------------')
print('C parameter: ', c_param)
print('-------------------------------------------')
print('')
recall_accs = []
for iteration, indices in enumerate(fold,start=1):
# Call the logistic regression model with a certain C parameter
lr = LogisticRegression(C = c_param, penalty = 'l1')
# Use the training data to fit the model. In this case, we use the portion of the fold to train the model
# with indices[0]. We then predict on the portion assigned as the 'test cross validation' with indices[1]
lr.fit(x_train_data.iloc[indices[0],:],y_train_data.iloc[indices[0],:].values.ravel())
# Predict values using the test indices in the training data
y_pred_undersample = lr.predict(x_train_data.iloc[indices[1],:].values)
# Calculate the recall score and append it to a list for recall scores representing the current c_parameter
recall_acc = recall_score(y_train_data.iloc[indices[1],:].values,y_pred_undersample)
recall_accs.append(recall_acc)
print('Iteration ', iteration,': recall score = ', recall_acc)
# The mean value of those recall scores is the metric we want to save and get hold of.
results_table.ix[j,'Mean recall score'] = np.mean(recall_accs)
j += 1
print('')
print('Mean recall score ', np.mean(recall_accs))
print('')
best_c = results_table.loc[results_table['Mean recall score'].idxmax()]['C_parameter']
# Finally, we can check which C parameter is the best amongst the chosen.
print('*********************************************************************************')
print('Best model to choose from cross validation is with C parameter = ', best_c)
print('*********************************************************************************')
return best_c
The errors I'm getting are as follows:
for this line: fold = KFold(len(y_train_data),5,shuffle=False)
Error:
TypeError: init() got multiple values for argument 'shuffle'
if I remove the shuffle=False
from this line, I'm getting the following error:
TypeError: shuffle must be True or False; got 5
If I remove the 5
and keep the shuffle=False
, I'm getting the following error;
TypeError: 'KFold' object is not iterable which is from this line:
for iteration, indices in enumerate(fold,start=1):
If someone can help me with solving this issue and suggest how this can be done with the latest version of scikit-learn it will be very appreciated.
Thanks.
That depends on how you have imported the KFold.
If you have did this:
from sklearn.cross_validation import KFold
Then your code should work. Because it requires 3 params :- length of array, number of splits, and shuffle
But if you are doing this:
from sklearn.model_selection import KFold
then this will not work and you only need to pass the number of splits and shuffle. No need to pass the length of array along with making changes in enumerate()
.
By the way, the model_selection is the new module and recommended to use. Try using it like this:
fold = KFold(5,shuffle=False)
for train_index, test_index in fold.split(X):
# Call the logistic regression model with a certain C parameter
lr = LogisticRegression(C = c_param, penalty = 'l1')
# Use the training data to fit the model. In this case, we use the portion of the fold to train the model
lr.fit(x_train_data.iloc[train_index,:], y_train_data.iloc[train_index,:].values.ravel())
# Predict values using the test indices in the training data
y_pred_undersample = lr.predict(x_train_data.iloc[test_index,:].values)
# Calculate the recall score and append it to a list for recall scores representing the current c_parameter
recall_acc = recall_score(y_train_data.iloc[test_index,:].values,y_pred_undersample)
recall_accs.append(recall_acc)