Train test split in `r`'s `caret` package

5xum picture 5xum · Mar 1, 2016 · Viewed 10.5k times · Source

I'm getting familiar with r's caret package, but, coming from other programming language, it thorougly confused me.

What I want to do now is a fairly simple machine learning workflow, which is:

  1. Take a training set, in my case the iris dataset
  2. Split it into a training and test set (a 80-20 split)
  3. For every k from 1 to 20, train the k nearest neighbor classifier on the training set
  4. Test it on the test set

I understand how to do the first part, since iris is already loaded. Then, the second part is done by calling

a <- createDataPartition(iris$Species, list=FALSE)
training <- iris[a,]
test <- iris[-a,]

Now, I also know that I can train the model by calling

library(caret)
knnFit <- train()
knnFit <- train(Species~., data=training, method="knn")

However, this will result in r already performing some optimisation on the parameter k. Of course, I can limit what values of k the method should try, with something like

knnFit <- train(Species~., data=training, method="knn", tuneGrid=data.frame(k=1:20))

which works just fine, but it still doesn't to exactly what I want it to do. This code will now do, for each k:

  1. take a bootstrap sample from the test.
  2. Asses the performance of the k-nn method using the given sample

What I want it to do:

  1. For each k, train the model on the same train set which I constructed earlier
  2. Asses the performance **on the same test set which I constructed earlier.

So I would need something like

knnFit <- train(Species~., training_data=training, test_data=test, method="knn", tuneGrid=data.frame(k=1:20))

but this of course does not work.

I understand I should do something with the trainControl parameter, but I see its possible methods are:

"boot", "boot632", "cv", "repeatedcv", "LOOCV", "LGOCV", "none"

and none of these seems to do what I want.

Answer

thie1e picture thie1e · Mar 6, 2016

If I understand the question correctly, this can be done all within caret using LGOCV (Leave-group-out-CV = repeated train/test split) and setting the training percentage p = 0.8 and the repeats of the train/test split to number = 1 if you really want just one model fit per k that is tested on a testset. Setting number > 1 will repeatedly assess model performance on number different train/test splits.

data(iris)
library(caret)
set.seed(123)
mod <- train(Species ~ ., data = iris, method = "knn", 
             tuneGrid = expand.grid(k=1:20),
             trControl = trainControl(method = "LGOCV", p = 0.8, number = 1,
                                      savePredictions = T))

All predictions that have been made by the different models on the test set are in mod$pred if savePredictions = T. Note rowIndex: These are the rows that have been sampled into the test set. Those are equal for all different values of k, so the same training/test sets are used every time.

> head(mod$pred)
    pred    obs rowIndex k  Resample
1 setosa setosa        5 1 Resample1
2 setosa setosa        6 1 Resample1
3 setosa setosa       10 1 Resample1
4 setosa setosa       12 1 Resample1
5 setosa setosa       16 1 Resample1
6 setosa setosa       17 1 Resample1
> tail(mod$pred)
         pred       obs rowIndex  k  Resample
595 virginica virginica      130 20 Resample1
596 virginica virginica      131 20 Resample1
597 virginica virginica      135 20 Resample1
598 virginica virginica      137 20 Resample1
599 virginica virginica      145 20 Resample1
600 virginica virginica      148 20 Resample1 

There's no need to construct train/test sets manually outside of caret unless some kind of nested validation prodedure is desired. You can also plot the validation-curve for the different values of k by plot(mod).