Missing values error in train() function Caret for trees

Carolina Leana Santos picture Carolina Leana Santos · Apr 27, 2017 · Viewed 8.1k times · Source

I'm new using R and I'm trying to build a decision tree. I've already used the package party for ctree and rpart for rpart.

But, as I needed to do cross validation for my model I start using the caret package since I'm able to do that by using the function `train() and the method I want to use.

library(caret)
cvCtrl <- trainControl(method = "repeatedcv", repeats = 2,
                   classProbs = TRUE)

ctree.installed<- train(TARGET ~ OPENING_BALANCE+ MONTHS_SINCE_EXPEDITION+
                    RS_DESC+SAP_STATUS+ ACTIVATION_STATUS+ ROTUL_STATUS+ 
                    SIM_STATUS+ RATE_PLAN_SEGMENT_NORM,
                    data=trainSet,
                    method = "ctree",
                    trControl = cvCtrl)

However, my variables OPENING_BALANCE and MONTHS_SINCE_EXPEDITION have some missing values and the function doesn't work because of that. I don't understand why this happens since I'm trying to build a tree. This problem doesn't occur when i'm using the other packages.

This is the error:

Error in na.fail.default(list(TARGET = c(3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,  : 
missing values in object

I didn't want to use na.action=pass since I really don't want to discard those observations.

Am I doing something wrong? Why is this happening? Do you have any suggestions for this?

Answer

Marco Sandri picture Marco Sandri · Apr 27, 2017

I start considering the dataset PimaIndiansDiabetes2 of the mlbench package which has some missing values.

data(PimaIndiansDiabetes2, package = "mlbench")
head(PimaIndiansDiabetes2)

  pregnant glucose pressure triceps insulin mass pedigree age diabetes
1        6     148       72      35      NA 33.6    0.627  50      pos
2        1      85       66      29      NA 26.6    0.351  31      neg
3        8     183       64      NA      NA 23.3    0.672  32      pos
4        1      89       66      23      94 28.1    0.167  21      neg
5        0     137       40      35     168 43.1    2.288  33      pos
6        5     116       74      NA      NA 25.6    0.201  30      neg

In train I set na.action to na.pass (which leads to return the dataset unchanged) and then set the maxsurrogate parameter in ctree:

library(caret)
cvCtrl <- trainControl(method="repeatedcv", repeats = 2, classProbs = TRUE)
set.seed(1234)
ctree1 <- train(diabetes ~ ., data=PimaIndiansDiabetes2,
                    method = "ctree",
                    na.action  = na.pass,
                    trControl = cvCtrl,
                    controls=ctree_control(maxsurrogate=2))

The results is:

print(ctree1)
Conditional Inference Tree 

392 samples
  8 predictor
  2 classes: 'neg', 'pos' 

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 2 times) 
Summary of sample sizes: 691, 692, 691, 691, 691, 691, ... 
Resampling results across tuning parameters:

  mincriterion  Accuracy   Kappa    
  0.01          0.7349111  0.4044195
  0.50          0.7485731  0.4412557
  0.99          0.7323906  0.3921662

Accuracy was used to select the optimal model using  the largest value.
The final value used for the model was mincriterion = 0.5.