Combining random forests built with different training sets in R

josh picture josh · Oct 4, 2013 · Viewed 9k times · Source

I am new to R (day 2) and have been tasked with building a forest of random forests. Each individual random forest will be built using a different training set and we will combine all the forests at the end to make predictions. I am implementing this in R and am having some difficulty combining two forests not built using the same set. My attempt is as follows:

d1 = read.csv("../data/rr/train/10/chunk0.csv",header=TRUE)
d2 = read.csv("../data/rr/train/10/chunk1.csv",header=TRUE)

rf1 = randomForest(A55~., data=d1, ntree=10)
rf2 = randomForest(A55~., data=d2, ntree=10)

rf = combine(rf1,rf2)

This of course produces an error:

Error in rf$votes + ifelse(is.na(rflist[[i]]$votes), 0, rflist[[i]]$votes) : 
non-conformable arrays
In addition: Warning message:
In rf$oob.times + rflist[[i]]$oob.times :
longer object length is not a multiple of shorter object length

I have been browsing the web for some time looking at a clue for this but haven't had any success yet. Any help here would be most appreciated.

Answer

joran picture joran · Oct 4, 2013

Ah. This is either an oversight in combine or what you're trying to do is nonsensical, depending on your point of view.

The votes matrix records the number of votes in the forest for each case in the training data for each response category. Naturally, it will have the same number of rows as the number of rows in your training data.

combine is assuming that you ran your random forests twice on the same set of data, so the dimensions of those matrices will be the same. It's doing this because it wants to provide you with some "overall" error estimates for the combined forest.

But if the two data sets are different combining the votes matrices becomes simply nonsensical. You could get combine to run by simply removing one row from your larger training data set, but the resulting votes matrix in the combined forest would be gibberish, since each row would be a combination of votes for two different training cases.

So maybe this is simply something that should be an option that can be turned off in combine. Because it should still make sense to combine the actual trees and predict on the resulting object. But some of "combined" error estimates in the output from combine will be meaningless.

Long story short, make each training data set the same size, and it will run. But if you do, I wouldn't use the resulting object for anything other than making new predictions. Anything that is combined that was summarizing the performance of the forests will be nonsense.

However, I think the intended way to use combine is to fit multiple random forests on the full data set, but with a reduced number of trees and then to combine those forests.

Edit

I went ahead and modified combine to "handle" unequal training set sizes. All that means really is that I removed a large chunk of code that was trying to stitch things together that weren't going to match up. But I kept the portion that combines the forests, so you can still use predict:

my_combine <- function (...) 
{
    pad0 <- function(x, len) c(x, rep(0, len - length(x)))
    padm0 <- function(x, len) rbind(x, matrix(0, nrow = len - 
        nrow(x), ncol = ncol(x)))
    rflist <- list(...)
    areForest <- sapply(rflist, function(x) inherits(x, "randomForest"))
    if (any(!areForest)) 
        stop("Argument must be a list of randomForest objects")
    rf <- rflist[[1]]
    classRF <- rf$type == "classification"
    trees <- sapply(rflist, function(x) x$ntree)
    ntree <- sum(trees)
    rf$ntree <- ntree
    nforest <- length(rflist)
    haveTest <- !any(sapply(rflist, function(x) is.null(x$test)))
    vlist <- lapply(rflist, function(x) rownames(importance(x)))
    numvars <- sapply(vlist, length)
    if (!all(numvars[1] == numvars[-1])) 
        stop("Unequal number of predictor variables in the randomForest objects.")
    for (i in seq_along(vlist)) {
        if (!all(vlist[[i]] == vlist[[1]])) 
            stop("Predictor variables are different in the randomForest objects.")
    }
    haveForest <- sapply(rflist, function(x) !is.null(x$forest))
    if (all(haveForest)) {
        nrnodes <- max(sapply(rflist, function(x) x$forest$nrnodes))
        rf$forest$nrnodes <- nrnodes
        rf$forest$ndbigtree <- unlist(sapply(rflist, function(x) x$forest$ndbigtree))
        rf$forest$nodestatus <- do.call("cbind", lapply(rflist, 
            function(x) padm0(x$forest$nodestatus, nrnodes)))
        rf$forest$bestvar <- do.call("cbind", lapply(rflist, 
            function(x) padm0(x$forest$bestvar, nrnodes)))
        rf$forest$xbestsplit <- do.call("cbind", lapply(rflist, 
            function(x) padm0(x$forest$xbestsplit, nrnodes)))
        rf$forest$nodepred <- do.call("cbind", lapply(rflist, 
            function(x) padm0(x$forest$nodepred, nrnodes)))
        tree.dim <- dim(rf$forest$treemap)
        if (classRF) {
            rf$forest$treemap <- array(unlist(lapply(rflist, 
                function(x) apply(x$forest$treemap, 2:3, pad0, 
                  nrnodes))), c(nrnodes, 2, ntree))
        }
        else {
            rf$forest$leftDaughter <- do.call("cbind", lapply(rflist, 
                function(x) padm0(x$forest$leftDaughter, nrnodes)))
            rf$forest$rightDaughter <- do.call("cbind", lapply(rflist, 
                function(x) padm0(x$forest$rightDaughter, nrnodes)))
        }
        rf$forest$ntree <- ntree
        if (classRF) 
            rf$forest$cutoff <- rflist[[1]]$forest$cutoff
    }
    else {
        rf$forest <- NULL
    }
    #
    #Tons of stuff removed here...
    #
    if (classRF) {
        rf$confusion <- NULL
        rf$err.rate <- NULL
        if (haveTest) {
            rf$test$confusion <- NULL
            rf$err.rate <- NULL
        }
    }
    else {
        rf$mse <- rf$rsq <- NULL
        if (haveTest) 
            rf$test$mse <- rf$test$rsq <- NULL
    }
    rf
}

And then you can test it like this:

data(iris)
d <- iris[sample(150,150),]
d1 <- d[1:70,]
d2 <- d[71:150,]
rf1 <- randomForest(Species ~ ., d1, ntree=50, norm.votes=FALSE)
rf2 <- randomForest(Species ~ ., d2, ntree=50, norm.votes=FALSE)

rf.all <- my_combine(rf1,rf2)
predict(rf.all,newdata = iris)

Obviously, this comes with absolutely no warranty! :)