R - predict command error "undefined columns selected"

user1907117 picture user1907117 · Dec 16, 2012 · Viewed 12.4k times · Source

I’m a newbie to R, and I’m having trouble with an R predict command. I receive this error

 Error in `[.data.frame`(newdata, , as.character(object$formula[[2]])) : 
  undefined columns selected

when I execute this command:

model.predict <- predict.boosting(model,newdata=test)

Here is my model:

model <- boosting(Y~x1+x2+x3+x4+x5+x6+x7, data=train)

And here is the structure of my test data: str(test)

'data.frame':   343 obs. of  7 variables:
 $ x1: Factor w/ 4 levels "Americas","Asia_Pac",..: 4 2 4 2 4 3 3 3 4 1 ...
 $ x2: Factor w/ 5 levels "Fifth","First",..: 3 3 2 2 4 2 4 4 1 1 ...
 $ x3: Factor w/ 3 levels "Best","Better",..: 2 3 1 1 3 2 2 1 3 3 ...
 $ x4: Factor w/ 2 levels "Female","Male": 1 1 2 1 1 2 1 2 2 2 ...
 $ x5: int  82 55 47 31 6 53 77 68 76 86 ...
 $ x6: num  22.8 14.6 25.5 38.3 7.9 32.8 4.6 34.2 36.7 21.7 ...
 $ x7: num  0.679 0.925 0.897 0.684 0.195 ...

And the structure of my training data:

$ RecordID: int  1 2 3 4 5 6 7 8 9 10 ...
 $ x1      : Factor w/ 4 levels "Americas","Asia_Pac",..: 1 2 2 3 1 1 1 2 2 4 ...
 $ x2      : Factor w/ 5 levels "Fifth","First",..: 5 5 3 2 5 5 5 4 3 2 ...
 $ x3      : Factor w/ 3 levels "Best","Better",..: 2 3 2 2 3 1 2 3 1 1 ...
 $ x4      : Factor w/ 2 levels "Female","Male": 1 2 2 2 1 1 2 2 1 1 ...
 $ x5      : int  1 67 75 51 84 33 21 80 48 5 ...
 $ x6      : num  21 13.8 30.3 11.9 1.7 13.2 33.9 17 3.4 19.5 ...
 $ x7      : num  0.35 0.85 0.73 0.39 0.47 0.13 0.2 0.12 0.64 0.11 ...
 $ Y       : Factor w/ 2 levels "Green","Yellow": 2 2 1 2 2 2 1 2 2 2 ..

I think there’s a problem with the structure of the test data, but I can’t find it, or I have a mis-understanding as to the structure of the “predict” command. Note that if I run the predict command on the training data, it works. Any suggestions as to where to look?

Thanks!

Answer

MattBagg picture MattBagg · Dec 16, 2012

predict.boosting() expects to be given the actual labels for the test data, so it can calculate how well it did (as in the confusion matrix shown below).

library(adabag) 

data(iris)

iris.adaboost <- boosting(Species~Sepal.Length+Sepal.Width+Petal.Length+
      Petal.Width, data=iris, boos=TRUE, mfinal=10)

# make a 'test' dataframe without the classes, as in the question
iris2 <- iris
iris2$Species <- NULL

# replicates the error
irispred=predict.boosting(iris.adaboost, newdata=iris2)
#Error in `[.data.frame`(newdata, , as.character(object$formula[[2]])) : 
#  undefined columns selected

Here's working example, drawn largely from the help file just so there is a working example here (and to demonstrate the confusion matrix).

# first create subsets of iris data for training and testing  
sub <- c(sample(1:50, 25), sample(51:100, 25), sample(101:150, 25))
iris3 <- iris[sub,]
iris4 <- iris[-sub,]

iris.adaboost <- boosting(Species ~ ., data=iris3, mfinal=10)

# works
iris.predboosting<- predict.boosting(iris.adaboost, newdata=iris4)

iris.predboosting$confusion
#               Observed Class
#Predicted Class setosa versicolor virginica
#     setosa         50          0         0
#     versicolor      0         50         0
#     virginica       0          0        50