I'm new using R and I'm trying to build a decision tree. I've already used the package party
for ctree
and rpart
for rpart.
But, as I needed to do cross validation for my model I start using the caret
package since I'm able to do that by using the function `train() and the method I want to use.
library(caret)
cvCtrl <- trainControl(method = "repeatedcv", repeats = 2,
classProbs = TRUE)
ctree.installed<- train(TARGET ~ OPENING_BALANCE+ MONTHS_SINCE_EXPEDITION+
RS_DESC+SAP_STATUS+ ACTIVATION_STATUS+ ROTUL_STATUS+
SIM_STATUS+ RATE_PLAN_SEGMENT_NORM,
data=trainSet,
method = "ctree",
trControl = cvCtrl)
However, my variables OPENING_BALANCE
and MONTHS_SINCE_EXPEDITION
have some missing values and the function doesn't work because of that. I don't understand why this happens since I'm trying to build a tree. This problem doesn't occur when i'm using the other packages.
This is the error:
Error in na.fail.default(list(TARGET = c(3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, :
missing values in object
I didn't want to use na.action=pass
since I really don't want to discard those observations.
Am I doing something wrong? Why is this happening? Do you have any suggestions for this?
I start considering the dataset PimaIndiansDiabetes2
of the mlbench
package which has some missing values.
data(PimaIndiansDiabetes2, package = "mlbench")
head(PimaIndiansDiabetes2)
pregnant glucose pressure triceps insulin mass pedigree age diabetes
1 6 148 72 35 NA 33.6 0.627 50 pos
2 1 85 66 29 NA 26.6 0.351 31 neg
3 8 183 64 NA NA 23.3 0.672 32 pos
4 1 89 66 23 94 28.1 0.167 21 neg
5 0 137 40 35 168 43.1 2.288 33 pos
6 5 116 74 NA NA 25.6 0.201 30 neg
In train
I set na.action
to na.pass
(which leads to return the dataset unchanged) and then set the maxsurrogate
parameter in ctree
:
library(caret)
cvCtrl <- trainControl(method="repeatedcv", repeats = 2, classProbs = TRUE)
set.seed(1234)
ctree1 <- train(diabetes ~ ., data=PimaIndiansDiabetes2,
method = "ctree",
na.action = na.pass,
trControl = cvCtrl,
controls=ctree_control(maxsurrogate=2))
The results is:
print(ctree1)
Conditional Inference Tree
392 samples
8 predictor
2 classes: 'neg', 'pos'
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 2 times)
Summary of sample sizes: 691, 692, 691, 691, 691, 691, ...
Resampling results across tuning parameters:
mincriterion Accuracy Kappa
0.01 0.7349111 0.4044195
0.50 0.7485731 0.4412557
0.99 0.7323906 0.3921662
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was mincriterion = 0.5.