When I train just using glm
, everything works, and I don't even come close to exhausting memory. But when I run train(..., method='glm')
, I run out of memory.
Is this because train
is storing a lot of data for each iteration of the cross-validation (or whatever the trControl procedure is)? I'm looking at trainControl
and I can't find how to prevent this...any hints? I only care about the performance summary and maybe the predicted responses.
(I know it's not related to storing data from each iteration of the parameter-tuning grid search because there's no grid for glm's, I believe.)
The problem is two fold. i) train
doesn't just fit a model via glm()
, it will bootstrap that model, so even with the defaults, train()
will do 25 bootstrap samples, which, coupled with problem ii) is the (or a) source of your problem, and ii) train()
simply calls the glm()
function with its defaults. And those defaults are to store the model frame (argument model = TRUE
of ?glm
), which includes a copy of the data in model frame style. The object returned by train()
already stores a copy of the data in $trainingData
, and the "glm"
object in $finalModel
also has a copy of the actual data.
At this point, simply running glm()
using train()
will be producing 25 copies of the fully expanded model.frame
and the original data, which will all need to be held in memory during the resampling process - whether these are held concurrently or consecutively is not immediately clear from a quick look at the code as the resampling happens in an lapply()
call. There will also be 25 copies of the raw data.
Once the resampling is finished, the returned object will contain 2 copies of the raw data and a full copy of the model.frame
. If your training data is large relative to available RAM or contains many factors to be expanded in the model.frame
, then you could easily be using huge amounts of memory just carrying copies of the data around.
If you add model = FALSE
to your train call, that might make a difference. Here is a small example using the clotting
data in ?glm
:
clotting <- data.frame(u = c(5,10,15,20,30,40,60,80,100),
lot1 = c(118,58,42,35,27,25,21,19,18),
lot2 = c(69,35,26,21,18,16,13,12,12))
require(caret)
then
> m1 <- train(lot1 ~ log(u), data=clotting, family = Gamma, method = "glm",
+ model = TRUE)
Fitting: parameter=none
Aggregating results
Fitting model on full training set
> m2 <- train(lot1 ~ log(u), data=clotting, family = Gamma, method = "glm",
+ model = FALSE)
Fitting: parameter=none
Aggregating results
Fitting model on full training set
> object.size(m1)
121832 bytes
> object.size(m2)
116456 bytes
> ## ordinary glm() call:
> m3 <- glm(lot1 ~ log(u), data=clotting, family = Gamma)
> object.size(m3)
47272 bytes
> m4 <- glm(lot1 ~ log(u), data=clotting, family = Gamma, model = FALSE)
> object.size(m4)
42152 bytes
So there is a size difference in the returned object and memory use during training will be lower. How much lower will depend on whether the internals of train()
keep all copies of the model.frame
in memory during the resampling process.
The object returned by train()
is also significantly larger than that returned by glm()
- as mentioned by @DWin in the comments, below.
To take this further, either study the code more closely, or email Max Kuhn, the maintainer of caret, to enquire about options to reduce the memory footprint.