I'm optimizing some paramters for an SVC in sklearn, and the biggest issue here is having to wait 30 minutes before I try out any other parameter ranges. Worse is the fact that I'd like to try more values for c and gamma within the same range (so I can create a smoother surface plot) but I know that it will just take longer and longer... When I ran it today I changed the cache_size from 200 to 600 (without really knowing what it does) to see if it made a difference. The time decreased by about a minute.
Is this something I can help? Or am I just gonna have to deal with a very long time?
clf = svm.SVC(kernel="rbf" , probability = True, cache_size = 600)
gamma_range = [1e-7,1e-6,1e-5,1e-4,1e-3,1e-2,1e-1,1e0,1e1]
c_range = [1e-3,1e-2,1e-1,1e0,1e1,1e2,1e3,1e4,1e5]
param_grid = dict(gamma = gamma_range, C = c_range)
grid = GridSearchCV(clf, param_grid, cv= 10, scoring="accuracy")
%time grid.fit(X_norm, y)
returns:
Wall time: 32min 59s
GridSearchCV(cv=10, error_score='raise',
estimator=SVC(C=1.0, cache_size=600, class_weight=None, coef0=0.0, degree=3, gamma=0.0,
kernel='rbf', max_iter=-1, probability=True, random_state=None,
shrinking=True, tol=0.001, verbose=False),
fit_params={}, iid=True, loss_func=None, n_jobs=1,
param_grid={'C': [0.001, 0.01, 0.1, 1.0, 10.0, 100.0, 1000.0, 10000.0, 100000.0], 'gamma': [1e-07, 1e-06, 1e-05, 0.0001, 0.001, 0.01, 0.1, 1.0, 10.0]},
pre_dispatch='2*n_jobs', refit=True, score_func=None,
scoring='accuracy', verbose=0)
A few things:
cv=3
in the GridSearchCV
call) without any meaningful difference in performance estimation.njobs
to 2+ in your GridSearchCV
call so you run multiple models at once. Depending on the size of your data, you may not be able to increase it too high, and you won't see an improvement increasing it past the number of cores you're running, but you can probably trim a bit of time that way.