How to estimate the progress of a GridSearchCV from verbose output in Scikit-Learn?

O.rka picture O.rka · Apr 13, 2017 · Viewed 13.6k times · Source

Right now I'm running a pretty aggressive grid search. I have n=135 samples and I am running 23 folds using a custom cross-validation train/test list. I have my verbose=2.

The following is what I ran:

param_test = {"loss":["deviance"],
           'learning_rate':[0.01, 0.025, 0.05, 0.075, 0.1, 0.15, 0.2],
           "min_samples_split": np.linspace(0.1, 0.5, 12),
           "min_samples_leaf": np.linspace(0.1, 0.5, 12),
           "max_depth":[3,5,8],
          "max_features":["log2","sqrt"],
          "min_impurity_split":[5e-6, 1e-7, 5e-7],
          "criterion": ["friedman_mse",  "mae"],
           "subsample":[0.5, 0.618, 0.8, 0.85, 0.9, 0.95, 1.0],
          "n_estimators":[10]}

Mod_gsearch = GridSearchCV(estimator = GradientBoostingClassifier(),
                           param_grid = param_test, scoring="accuracy",n_jobs=32, iid=False, cv=cv_indices, verbose=2)

I took a look at the verbose output in stdout:

$head gridsearch.o8475533
Fitting 23 folds for each of 254016 candidates, totalling 5842368 fits

Based on this, it looks like there are 5842368 permutations of cross-validation pairs using my grid params.

$ grep -c  "[CV]" gridsearch.o8475533
7047332 

It looks like there are around 7 million cross-validations that have been done so far but that's more than the 5842368 total fits...

7047332/5842368 = 1.2062458236

Then when I look at the stderr file:

$ cat ./gridsearch.e8475533
[Parallel(n_jobs=32)]: Done 132 tasks      | elapsed:    1.2s
[Parallel(n_jobs=32)]: Done 538 tasks      | elapsed:    2.8s
[Parallel(n_jobs=32)]: Done 1104 tasks      | elapsed:    4.8s
[Parallel(n_jobs=32)]: Done 1834 tasks      | elapsed:    7.9s
[Parallel(n_jobs=32)]: Done 2724 tasks      | elapsed:   11.6s
...
[Parallel(n_jobs=32)]: Done 3396203 tasks      | elapsed: 250.2min
[Parallel(n_jobs=32)]: Done 3420769 tasks      | elapsed: 276.5min
[Parallel(n_jobs=32)]: Done 3447309 tasks      | elapsed: 279.3min
[Parallel(n_jobs=32)]: Done 3484240 tasks      | elapsed: 282.3min
[Parallel(n_jobs=32)]: Done 3523550 tasks      | elapsed: 285.3min

My goal:

How can I know the progress of my gridsearch with respect to the total time it may take?

What I'm confused about:

What is the relationship between [CV] lines in stdout, total # of fits in stdout, and tasks in stderr?

Answer

vladkha picture vladkha · May 21, 2017

Math is simple, but a little misleading at a first sight:

  1. When each task is started logging mechanism yields a '[CV] ...' line to stdout noting about starting of execution and after task ends - another line with the addition of spent time for a particular task (in the end of the line).

  2. Additionally, with some time intervals, logging mechanism writes a progress bar to stderr (or if you set verbose to >50 to stdout) indicating a number of completed task out of total tasks (fits) and total currently spent time, like that one:

    [Parallel(n_jobs=32)]: Done 2724 tasks | elapsed: 11.6s

For your case, you have 5842368 total fits, i.e. tasks.

You counted 7047332 of '[CV] ...' which is around 7047332/2 = 3523666 finished tasks and progress bar shows exactly how many tasks are completed - 3523550 (around - because some tasks could start, but not end at the time of counting).