setting values for ntree and mtry for random forest regression model

DOSMarter picture DOSMarter · Dec 19, 2012 · Viewed 76.6k times · Source

I'm using R package randomForest to do a regression on some biological data. My training data size is 38772 X 201.

I just wondered---what would be a good value for the number of trees ntree and the number of variable per level mtry? Is there an approximate formula to find such parameter values?

Each row in my input data is a 200 character representing the amino acid sequence, and I want to build a regression model to use such sequence in order to predict the distances between the proteins.

Answer

Jeffrey Evans picture Jeffrey Evans · Dec 19, 2012

The default for mtry is quite sensible so there is not really a need to muck with it. There is a function tuneRF for optimizing this parameter. However, be aware that it may cause bias.

There is no optimization for the number of bootstrap replicates. I often start with ntree=501 and then plot the random forest object. This will show you the error convergence based on the OOB error. You want enough trees to stabilize the error but not so many that you over correlate the ensemble, which leads to overfit.

Here is the caveat: variable interactions stabilize at a slower rate than error so, if you have a large number of independent variables you need more replicates. I would keep the ntree an odd number so ties can be broken.

For the dimensions of you problem I would start ntree=1501. I would also recommended looking onto one of the published variable selection approaches to reduce the number of your independent variables.