Creating data partition in R

Aiden picture Aiden · Jul 20, 2016 · Viewed 26.1k times · Source

With caret package, when creating data partition 75% training and 25% test, we use:

inTrain<- createDataPartition(y=spam$type,p=0.75, list=FALSE)

Note: dataset is named spam and target variable is named type

My question is, what is the purpose of including y=spam$type argument?

Isn’t the purpose of creating data partitions simply to split the entire data set based on the proportion you require for training vs testing? Why is there the need to include that argument in the code?

Answer

Imran Ali picture Imran Ali · Jul 20, 2016

I have assumed that the createDataPartition() in question is referring to the caret package.

If sample$type argument is a factor which is generally the case, the random sampling occurs within each class.

Some more explanation: For example if we were to partition the iris data set in the same proportion as in your question.

attach(iris)
summary(iris)

notice the numbers against each species. Now using the following command:

library(caret)
inTrain <- createDataPartition(y=Species, p=0.75, list=FALSE)  

inTrain would take approximately 75% rows from each species, which can be verified by issuing the following command:

summary(iris[inTrain,])

There are 50 species in each category, and 38 (approximately 75%)have been randomly selected for the training data set.