scikit-learn GridSearchCV with multiple repetitions

Titus Pullo picture Titus Pullo · Feb 14, 2017 · Viewed 11.7k times · Source

I'm trying to get the best set of parameters for an SVR model. I'd like to use the GridSearchCV over different values of C. However, from previous test I noticed that the split into Training/Test set higlhy influence the overall performance (r2 in this instance). To address this problem, I'd like to implement a repeated 5-fold cross validation (10 x 5CV). Is there a built in way of performing it using GridSearchCV?

QUICK SOLUTION:

Following the idea presented in the sci-kit offical documentation , a quick solution is represented by:

NUM_TRIALS = 10
scores = []
for i in range(NUM_TRIALS):
     cv = KFold(n_splits=5, shuffle=True, random_state=i)
     clf = GridSearchCV(estimator=svr, param_grid=p_grid, cv=cv)
     scores.append(clf.best_score_)
print "Average Score: {0} STD: {1}".format(numpy.mean(scores), numpy.std(scores))

Answer

Vivek Kumar picture Vivek Kumar · Feb 14, 2017

This is called as nested cross_validation. You can look at official documentation example to guide you into right direction and also have a look at my other answer here for a similar approach.

You can adapt the steps to suit your need:

svr = SVC(kernel="rbf")
c_grid = {"C": [1, 10, 100, ...  ]}

# CV Technique "LabelKFold", "LeaveOneOut", "LeaveOneLabelOut", etc.

# To be used within GridSearch (5 in your case)
inner_cv = KFold(n_splits=5, shuffle=True, random_state=i)

# To be used in outer CV (you asked for 10)
outer_cv = KFold(n_splits=10, shuffle=True, random_state=i)

# Non_nested parameter search and scoring
clf = GridSearchCV(estimator=svr, param_grid=c_grid, cv=inner_cv)
clf.fit(X_iris, y_iris)
non_nested_score = clf.best_score_

# Pass the gridSearch estimator to cross_val_score
# This will be your required 10 x 5 cvs
# 10 for outer cv and 5 for gridSearch's internal CV
clf = GridSearchCV(estimator=svr, param_grid=c_grid, cv=inner_cv)
nested_score = cross_val_score(clf, X=X_iris, y=y_iris, cv=outer_cv).mean()

Edit - Description of nested cross validation with cross_val_score() and GridSearchCV()

  1. clf = GridSearchCV(estimator, param_grid, cv= inner_cv).
  2. Pass clf, X, y, outer_cv to cross_val_score
  3. As seen in source code of cross_val_score, this X will be divided into X_outer_train, X_outer_test using outer_cv. Same for y.
  4. X_outer_test will be held back and X_outer_train will be passed on to clf for fit() (GridSearchCV in our case). Assume X_outer_train is called X_inner from here on since it is passed to inner estimator, assume y_outer_train is y_inner.
  5. X_inner will now be split into X_inner_train and X_inner_test using inner_cv in the GridSearchCV. Same for y
  6. Now the gridSearch estimator will be trained using X_inner_train and y_train_inner and scored using X_inner_test and y_inner_test.
  7. The steps 5 and 6 will be repeated for inner_cv_iters (5 in this case).
  8. The hyper-parameters for which the average score over all inner iterations (X_inner_train, X_inner_test) is best, is passed on to the clf.best_estimator_ and fitted for all data, i.e. X_outer_train.
  9. This clf (gridsearch.best_estimator_) will then be scored using X_outer_test and y_outer_test.
  10. The steps 3 to 9 will be repeated for outer_cv_iters (10 here) and array of scores will returned from cross_val_score
  11. We then use mean() to get back nested_score.