What is the meaning of 'mean_test_score' in cv_result?

Dipe picture Dipe · Jul 6, 2017 · Viewed 14.6k times · Source

Hello I'm doing a GridSearchCV and I'm printing the result with the .cv_results_ function from scikit learn.

My problem is that when I'm evaluating by hand the mean on all the test score splits I obtain a different number compared to what it is written in 'mean_test_score'. Which is different from the standard np.mean()?

I attach here the code with the result:

n_estimators = [100]
max_depth = [3]
learning_rate = [0.1]

param_grid = dict(max_depth=max_depth, n_estimators=n_estimators, learning_rate=learning_rate)

gkf = GroupKFold(n_splits=7)

grid_search = GridSearchCV(model, param_grid, scoring=score_auc, cv=gkf)
grid_result = grid_search.fit(X, Y, groups=patients)


The result of this operation is:

{'mean_fit_time': array([ 8.92773601]),
 'mean_score_time': array([ 0.04288721]),
 'mean_test_score': array([ 0.83490629]),
 'mean_train_score': array([ 0.95167036]),
 'param_learning_rate': masked_array(data = [0.1],
              mask = [False],
        fill_value = ?),
 'param_max_depth': masked_array(data = [3],
              mask = [False],
        fill_value = ?),
 'param_n_estimators': masked_array(data = [100],
              mask = [False],
        fill_value = ?),
 'params': ({'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 100},),
 'rank_test_score': array([1]),
 'split0_test_score': array([ 0.74821666]),
 'split0_train_score': array([ 0.97564995]),
 'split1_test_score': array([ 0.80089016]),
 'split1_train_score': array([ 0.95361201]),
 'split2_test_score': array([ 0.92876979]),
 'split2_train_score': array([ 0.93935856]),
 'split3_test_score': array([ 0.95540287]),
 'split3_train_score': array([ 0.94718634]),
 'split4_test_score': array([ 0.89083901]),
 'split4_train_score': array([ 0.94787374]),
 'split5_test_score': array([ 0.90926355]),
 'split5_train_score': array([ 0.94829775]),
 'split6_test_score': array([ 0.82520379]),
 'split6_train_score': array([ 0.94971417]),
 'std_fit_time': array([ 1.79167576]),
 'std_score_time': array([ 0.02970254]),
 'std_test_score': array([ 0.0809713]),
 'std_train_score': array([ 0.0105566])}

As you can see, doing the np.mean of all the test_score it gives you a value approximately of 0.8655122606479532 while the 'mean_test_score' is 0.83490629

Thanks for you help, Leonardo.


Johannes picture Johannes · Jul 6, 2017

I will post this as a new answer since its so much code:

The test and train scores of the folds are: (taken from the results you posted in your question)

test_scores = [0.74821666,0.80089016,0.92876979,0.95540287,0.89083901,0.90926355,0.82520379]
train_scores = [0.97564995,0.95361201,0.93935856,0.94718634,0.94787374,0.94829775,0.94971417]

The amount of training samples in those folds are: (taken from the output of print([(len(train), len(test)) for train, test in gkf.split(X, groups=patients)]))

train_len = [41835, 56229, 56581, 58759, 60893, 60919, 62056]
test_len = [24377, 9983, 9631, 7453, 5319, 5293, 4156]

Then the test- and train-means with the amount of training samples per fold as weight is:

train_avg = np.average(train_scores, weights=train_len)
-> 0.95064898361714389
test_avg = np.average(test_scores, weights=test_len)
-> 0.83490628649308296

So this is exactly the value sklearn gives you. It is also the correct mean accuracy of your classification. The mean of the folds is incorrect in that it depends on the somewhat arbitrary splits/folds you chose.

So in concusion, both explanations were indeed identical and correct.