Minimum number of observation when performing Random Forest

Oritteropus picture Oritteropus · Jul 9, 2013 · Viewed 13.2k times · Source

Is it possible to apply RandomForests to very small datasets? I have a dataset with many variables but only 25 observation each. Random forests produce reasonable results with low OOB errors (10-25%). Is there any rule of thumb regarding the minimum number of observations to use? In fact one of the response variable is unbalanced, and if I'm going to subsample it I will end up with an even smaller number of observations. Thanks in advance

Answer

Wake2Sleep picture Wake2Sleep · Aug 30, 2013

Absolutely RF can be used on these type of datasets (i.e. p>n). In fact they use RF in fields like genomics where the number of fields >= 20000 and there are only a very small number of rows - say 10-12. The entire problem is figuring out which of the 20k variables would make up a parsimonious marker (i.e. feature selection is the entire problem).

I don't have any ROTs about minimum size other than if your model doesn't work well on a held back sample (or Hold-One-Back cross validation might work well in your case) well then you should try something else.

Hope this helps