Is there a quicker way of running GridsearchCV

bidby picture bidby · Feb 26, 2016 · Viewed 15k times · Source

I'm optimizing some paramters for an SVC in sklearn, and the biggest issue here is having to wait 30 minutes before I try out any other parameter ranges. Worse is the fact that I'd like to try more values for c and gamma within the same range (so I can create a smoother surface plot) but I know that it will just take longer and longer... When I ran it today I changed the cache_size from 200 to 600 (without really knowing what it does) to see if it made a difference. The time decreased by about a minute.

Is this something I can help? Or am I just gonna have to deal with a very long time?

clf = svm.SVC(kernel="rbf" , probability = True, cache_size = 600)

gamma_range = [1e-7,1e-6,1e-5,1e-4,1e-3,1e-2,1e-1,1e0,1e1]
c_range = [1e-3,1e-2,1e-1,1e0,1e1,1e2,1e3,1e4,1e5]
param_grid = dict(gamma = gamma_range, C = c_range)

grid = GridSearchCV(clf, param_grid, cv= 10, scoring="accuracy")
%time grid.fit(X_norm, y)

returns:

Wall time: 32min 59s

GridSearchCV(cv=10, error_score='raise',
   estimator=SVC(C=1.0, cache_size=600, class_weight=None, coef0=0.0, degree=3, gamma=0.0,
kernel='rbf', max_iter=-1, probability=True, random_state=None,
shrinking=True, tol=0.001, verbose=False),
   fit_params={}, iid=True, loss_func=None, n_jobs=1,
   param_grid={'C': [0.001, 0.01, 0.1, 1.0, 10.0, 100.0, 1000.0, 10000.0, 100000.0], 'gamma': [1e-07, 1e-06, 1e-05, 0.0001, 0.001, 0.01, 0.1, 1.0, 10.0]},
   pre_dispatch='2*n_jobs', refit=True, score_func=None,
   scoring='accuracy', verbose=0)

Answer

Randy picture Randy · Feb 26, 2016

A few things:

  1. 10-fold CV is overkill and causes you to fit 10 models for each parameter group. You can get an instant 2-3x speedup by switching to 5- or 3-fold CV (i.e., cv=3 in the GridSearchCV call) without any meaningful difference in performance estimation.
  2. Try fewer parameter options at each round. With 9x9 combinations, you're trying 81 different combinations on each run. Typically, you'll find better performance at one end of the scale or the other, so maybe start with a coarse grid of 3-4 options, and then go finer as you start to identify the area that's more interesting for your data. 3x3 options means a 9x speedup vs. what you're doing now.
  3. You can get a trivial speedup by setting njobs to 2+ in your GridSearchCV call so you run multiple models at once. Depending on the size of your data, you may not be able to increase it too high, and you won't see an improvement increasing it past the number of cores you're running, but you can probably trim a bit of time that way.