I am struggling for several days to perform a classification tree using the caret package. The problem are my factor variables. I generate the tree, but when I try to use the best model to make predictions on the test sample, it fails, because the train function creates dummies for my factor variables and then the predict function cannot find these newly created dummies in the test set. How should I deal with this problem?
My code is as follows:
install.packages("caret", dependencies = c("Depends", "Suggests"))
library(caret)
db=data.frame(read.csv ("db.csv", head=TRUE, sep=";", na.strings ="?"))
fix(db)
db$defaillance=factor(db$defaillance)
db$def=ifelse(db$defaillance==0,"No","Yes")
db$def=factor(db$def)
db$defaillance=NULL
db$canal=factor(db$canal)
db$sect_isodev=factor(db$sect_isodev)
db$sect_risq=factor(db$sect_risq)
#delete zero variance predictors
nzv <- nearZeroVar(db[,-78])
db_new <- db[,-nzv]
inTrain <- createDataPartition(y = db_new$def, p = .75, list = FALSE)
training <- db_new[inTrain,]
testing <- db_new[-inTrain,]
str(training)
str(testing)
dim(training)
dim(testing)
A sample o the str() function for training/testing is found below:
$ FDR : num 1305 211 162 131 143 ...
$ FCYC : num 0.269 0.18 0.154 0.119 0.139 ...
$ BFDR : num 803 164 108 72 76 63 100 152 188 80 ...
$ TRES : num 502 47 54 59 67 49 53 -7 -103 -109 ...
$ sect_isodev: Factor w/ 9 levels "1","2","3","4",..: 4 3 3 3 3 3 3 3 3 3 ...
$ sect_risq : Factor w/ 6 levels "0","1","2","3",..: 6 6 6 6 6 6 6 6 6 6 ...
$ def : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
> dim(training)
[1] 14553 42
> dim(testing)
[1] 4850 42
Then my code goes like this:
fitControl <- trainControl(method = "repeatedcv",
number = 10,
repeats = 10,
classProbs = TRUE,
summaryFunction = twoClassSummary)
#CART1
set.seed(1234)
tree1 = train (def~.,
training,
method = "rpart",
tuneLength=20,
metric="ROC",
trControl = fitControl)
A sample of
summary(tree1$finalModel)
is here
RNTB 38.397731
sect_isodev1 6.742289
sect_isodev3 4.005016
sect_isodev8 2.520850
sect_risq3 9.909127
sect_risq4 6.737908
sect_risq5 3.085714
SOLV 73.067539
TRES 47.906884
sect_isodev2 0.000000
sect_isodev4 0.000000
sect_isodev5 0.000000
sect_isodev6 0.000000
sect_isodev7 0.000000
sect_isodev9 0.000000
sect_risq0 0.000000
sect_risq1 0.000000
sect_risq2 0.000000
And here is the error:
model.tree1 <- predict(tree1$finalModel,testing) Error in eval(expr, envir, enclos) : object 'sect_isodev1' not found
I am curious yet about another thing. I have found in Max Kuhn's "Predictive Modelling with R" the following syntax:
predict(rpartTune$finalModel, newdata, type = "class")
where rpartTune$finalModel
is a classification tree identical to mine (or mine identical to his).
Now, R doesn't accept type="class". Only type="prob". I am troubled because of that.
Thank you in advance for your responses
Don't use predict.rpart
with the train$finalModel
unless you have a really good reason. The rpart
object does;t know about anything that train
did, including pre-process. It may not give you the correct answer. After all, you might be using train
in order to avoid the minutia so let predict.train
do the work.
Max
EDIT -
About the type = "class"
and type = "prob"
bit..
predict.rpart
defaults to producing class probabilities. Although rpart
is one of the earliest packages, that is atypical as most produce classes by default.
predict.train
produces the classes by default and you have to use type = "prob"
to get probabilities.