TypeError: 'KFold' object is not iterable

kevinH picture kevinH · Feb 6, 2018 · Viewed 11.6k times · Source

I'm following one of the kernels on Kaggle, mainly, I'm following A kernel for Credit Card Fraud Detection.

I reached the step where I need to perform KFold in order to find the best parameters for Logistic Regression.

The following code is shown in the kernel itself, but for some reason (probably older version of scikit-learn, give me some errors).

def printing_Kfold_scores(x_train_data,y_train_data):
    fold = KFold(len(y_train_data),5,shuffle=False) 

    # Different C parameters
    c_param_range = [0.01,0.1,1,10,100]

    results_table = pd.DataFrame(index = range(len(c_param_range),2), columns = ['C_parameter','Mean recall score'])
    results_table['C_parameter'] = c_param_range

    # the k-fold will give 2 lists: train_indices = indices[0], test_indices = indices[1]
    j = 0
    for c_param in c_param_range:
        print('-------------------------------------------')
        print('C parameter: ', c_param)
        print('-------------------------------------------')
        print('')

        recall_accs = []
        for iteration, indices in enumerate(fold,start=1):

            # Call the logistic regression model with a certain C parameter
            lr = LogisticRegression(C = c_param, penalty = 'l1')

            # Use the training data to fit the model. In this case, we use the portion of the fold to train the model
            # with indices[0]. We then predict on the portion assigned as the 'test cross validation' with indices[1]
            lr.fit(x_train_data.iloc[indices[0],:],y_train_data.iloc[indices[0],:].values.ravel())

            # Predict values using the test indices in the training data
            y_pred_undersample = lr.predict(x_train_data.iloc[indices[1],:].values)

            # Calculate the recall score and append it to a list for recall scores representing the current c_parameter
            recall_acc = recall_score(y_train_data.iloc[indices[1],:].values,y_pred_undersample)
            recall_accs.append(recall_acc)
            print('Iteration ', iteration,': recall score = ', recall_acc)

            # The mean value of those recall scores is the metric we want to save and get hold of.
        results_table.ix[j,'Mean recall score'] = np.mean(recall_accs)
        j += 1
        print('')
        print('Mean recall score ', np.mean(recall_accs))
        print('')

    best_c = results_table.loc[results_table['Mean recall score'].idxmax()]['C_parameter']

    # Finally, we can check which C parameter is the best amongst the chosen.
    print('*********************************************************************************')
    print('Best model to choose from cross validation is with C parameter = ', best_c)
    print('*********************************************************************************')

    return best_c

The errors I'm getting are as follows: for this line: fold = KFold(len(y_train_data),5,shuffle=False) Error:

TypeError: init() got multiple values for argument 'shuffle'

if I remove the shuffle=False from this line, I'm getting the following error:

TypeError: shuffle must be True or False; got 5

If I remove the 5 and keep the shuffle=False, I'm getting the following error;

TypeError: 'KFold' object is not iterable which is from this line: for iteration, indices in enumerate(fold,start=1):

If someone can help me with solving this issue and suggest how this can be done with the latest version of scikit-learn it will be very appreciated.

Thanks.

Answer

Vivek Kumar picture Vivek Kumar · Feb 6, 2018

That depends on how you have imported the KFold.

If you have did this:

from sklearn.cross_validation import KFold

Then your code should work. Because it requires 3 params :- length of array, number of splits, and shuffle

But if you are doing this:

from sklearn.model_selection import KFold

then this will not work and you only need to pass the number of splits and shuffle. No need to pass the length of array along with making changes in enumerate().

By the way, the model_selection is the new module and recommended to use. Try using it like this:

fold = KFold(5,shuffle=False)

for train_index, test_index in fold.split(X):

    # Call the logistic regression model with a certain C parameter
    lr = LogisticRegression(C = c_param, penalty = 'l1')
    # Use the training data to fit the model. In this case, we use the portion of the fold to train the model
    lr.fit(x_train_data.iloc[train_index,:], y_train_data.iloc[train_index,:].values.ravel())

    # Predict values using the test indices in the training data
    y_pred_undersample = lr.predict(x_train_data.iloc[test_index,:].values)

    # Calculate the recall score and append it to a list for recall scores representing the current c_parameter
    recall_acc = recall_score(y_train_data.iloc[test_index,:].values,y_pred_undersample)
    recall_accs.append(recall_acc)