I have researched this extensively without finding a solution. I have cleaned my data set as follows:
library("raster")
impute.mean <- function(x) replace(x, is.na(x) | is.nan(x) | is.infinite(x) ,
mean(x, na.rm = TRUE))
losses <- apply(losses, 2, impute.mean)
colSums(is.na(losses))
isinf <- function(x) (NA <- is.infinite(x))
infout <- apply(losses, 2, is.infinite)
colSums(infout)
isnan <- function(x) (NA <- is.nan(x))
nanout <- apply(losses, 2, is.nan)
colSums(nanout)
The problem arises running the predict algorithm:
options(warn=2)
p <- predict(default.rf, losses, type="prob", inf.rm = TRUE, na.rm=TRUE, nan.rm=TRUE)
All the research says it should be NA's or Inf's or NaN's in the data but I don't find any. I am making the data and the randomForest summary available for sleuthing at [deleted] Traceback doesn't reveal much (to me anyway):
4: .C("classForest", mdim = as.integer(mdim), ntest = as.integer(ntest),
nclass = as.integer(object$forest$nclass), maxcat = as.integer(maxcat),
nrnodes = as.integer(nrnodes), jbt = as.integer(ntree), xts = as.double(x),
xbestsplit = as.double(object$forest$xbestsplit), pid = object$forest$pid,
cutoff = as.double(cutoff), countts = as.double(countts),
treemap = as.integer(aperm(object$forest$treemap, c(2, 1,
3))), nodestatus = as.integer(object$forest$nodestatus),
cat = as.integer(object$forest$ncat), nodepred = as.integer(object$forest$nodepred),
treepred = as.integer(treepred), jet = as.integer(numeric(ntest)),
bestvar = as.integer(object$forest$bestvar), nodexts = as.integer(nodexts),
ndbigtree = as.integer(object$forest$ndbigtree), predict.all = as.integer(predict.all),
prox = as.integer(proximity), proxmatrix = as.double(proxmatrix),
nodes = as.integer(nodes), DUP = FALSE, PACKAGE = "randomForest")
3: predict.randomForest(default.rf, losses, type = "prob", inf.rm = TRUE,
na.rm = TRUE, nan.rm = TRUE)
2: predict(default.rf, losses, type = "prob", inf.rm = TRUE, na.rm = TRUE,
nan.rm = TRUE)
1: predict(default.rf, losses, type = "prob", inf.rm = TRUE, na.rm = TRUE,
nan.rm = TRUE)
Your code is not entirely reproducible (there's no running of the actual randomForest
algorithm) but you are not replacing Inf
values with the means of column vectors. This is because the na.rm = TRUE
argument in the call to mean()
within your impute.mean
function does exactly what it says -- removes NA
values (and not Inf
ones).
You can see this, for example, by:
impute.mean <- function(x) replace(x, is.na(x) | is.nan(x) | is.infinite(x), mean(x, na.rm = TRUE))
losses <- apply(losses, 2, impute.mean)
sum( apply( losses, 2, function(.) sum(is.infinite(.))) )
# [1] 696
To get rid of infinite values, use:
impute.mean <- function(x) replace(x, is.na(x) | is.nan(x) | is.infinite(x), mean(x[!is.na(x) & !is.nan(x) & !is.infinite(x)]))
losses <- apply(losses, 2, impute.mean)
sum(apply( losses, 2, function(.) sum(is.infinite(.)) ))
# [1] 0