I'm looking for something similar to na.locf()
in the zoo
package, but instead of always using the previous non-NA
value I'd like to use the nearest non-NA
value. Some example data:
dat <- c(1, 3, NA, NA, 5, 7)
Replacing NA
with na.locf
(3 is carried forward):
library(zoo)
na.locf(dat)
# 1 3 3 3 5 7
and na.locf
with fromLast
set to TRUE
(5 is carried backwards):
na.locf(dat, fromLast = TRUE)
# 1 3 5 5 5 7
But I wish the nearest non-NA
value to be used. In my example this means that the 3 should be carried forward to the first NA
, and the 5 should be carried backwards to the second NA
:
1 3 3 5 5 7
I have a solution coded up, but wanted to make sure that I wasn't reinventing the wheel. Is there something already floating around?
FYI, my current code is as follows. Perhaps if nothing else, someone can suggest how to make it more efficient. I feel like I'm missing an obvious way to improve this:
na.pos <- which(is.na(dat))
if (length(na.pos) == length(dat)) {
return(dat)
}
non.na.pos <- setdiff(seq_along(dat), na.pos)
nearest.non.na.pos <- sapply(na.pos, function(x) {
return(which.min(abs(non.na.pos - x)))
})
dat[na.pos] <- dat[non.na.pos[nearest.non.na.pos]]
To answer smci's questions below:
Update So it turns out that we're going in a different direction altogether but this was still an interesting discussion. Thanks all!
Here is a very fast one. It uses findInterval
to find what two positions should be considered for each NA
in your original data:
f1 <- function(dat) {
N <- length(dat)
na.pos <- which(is.na(dat))
if (length(na.pos) %in% c(0, N)) {
return(dat)
}
non.na.pos <- which(!is.na(dat))
intervals <- findInterval(na.pos, non.na.pos,
all.inside = TRUE)
left.pos <- non.na.pos[pmax(1, intervals)]
right.pos <- non.na.pos[pmin(N, intervals+1)]
left.dist <- na.pos - left.pos
right.dist <- right.pos - na.pos
dat[na.pos] <- ifelse(left.dist <= right.dist,
dat[left.pos], dat[right.pos])
return(dat)
}
And here I test it:
# sample data, suggested by @JeffAllen
dat <- as.integer(runif(50000, min=0, max=10))
dat[dat==0] <- NA
# computation times
system.time(r0 <- f0(dat)) # your function
# user system elapsed
# 5.52 0.00 5.52
system.time(r1 <- f1(dat)) # this function
# user system elapsed
# 0.01 0.00 0.03
identical(r0, r1)
# [1] TRUE