Improving model training speed in caret (R)

Alexander David picture Alexander David · Oct 2, 2015 · Viewed 10k times · Source

I have a dataset consisting of 20 features and roughly 300,000 observations. I'm using caret to train model with doParallel and four cores. Even training on 10% of my data takes well over eight hours for the methods I've tried (rf, nnet, adabag, svmPoly). I'm resampling with with bootstrapping 3 times and my tuneLength is 5. Is there anything I can do to speed up this agonizingly slow process? Someone suggested using the underlying library can speed up my the process as much as 10x, but before I go down that route I'd like to make sure there is no other alternative.

Answer

topepo picture topepo · Oct 5, 2015

@phiver hits the nail on the head but, for this situation, there are a few things to suggest:

  • make sure that you are not exhausting your system memory by using parallel processing. You are making X extra copies of the data in memory when using X workers.
  • with a class imbalance, additional sampling can help. Downsampling might help improve performance and take less time.
  • use different libraries. ranger instead of randomForest, xgboost or C5.0 instead of gbm. You should realize that ensemble methods are fitting a ton of constituent models and a bound to take a while to fit.
  • the package has a racing-type algorithm for tuning parameters in less time
  • the development version on github has random search methods for the models with a lot of tuning parameters.

Max