Replacing NAs in R with nearest value

geoffjentry picture geoffjentry · Apr 9, 2012 · Viewed 8.1k times · Source

I'm looking for something similar to na.locf() in the zoo package, but instead of always using the previous non-NA value I'd like to use the nearest non-NA value. Some example data:

dat <- c(1, 3, NA, NA, 5, 7)

Replacing NA with na.locf (3 is carried forward):

library(zoo)
na.locf(dat)
# 1 3 3 3 5 7

and na.locf with fromLast set to TRUE (5 is carried backwards):

na.locf(dat, fromLast = TRUE)
# 1 3 5 5 5 7

But I wish the nearest non-NA value to be used. In my example this means that the 3 should be carried forward to the first NA, and the 5 should be carried backwards to the second NA:

1 3 3 5 5 7

I have a solution coded up, but wanted to make sure that I wasn't reinventing the wheel. Is there something already floating around?

FYI, my current code is as follows. Perhaps if nothing else, someone can suggest how to make it more efficient. I feel like I'm missing an obvious way to improve this:

  na.pos <- which(is.na(dat))
  if (length(na.pos) == length(dat)) {
    return(dat)
  }
  non.na.pos <- setdiff(seq_along(dat), na.pos)
  nearest.non.na.pos <- sapply(na.pos, function(x) {
    return(which.min(abs(non.na.pos - x)))
  })
  dat[na.pos] <- dat[non.na.pos[nearest.non.na.pos]]

To answer smci's questions below:

  1. No, any entry can be NA
  2. If all are NA, leave them as is
  3. No. My current solution defaults to the lefthand nearest value, but it doesn't matter
  4. These rows are a few hundred thousand elements typically, so in theory the upper bound would be a few hundred thousand. In reality it'd be no more than a few here & there, typically a single one.

Update So it turns out that we're going in a different direction altogether but this was still an interesting discussion. Thanks all!

Answer

flodel picture flodel · Apr 10, 2012

Here is a very fast one. It uses findInterval to find what two positions should be considered for each NA in your original data:

f1 <- function(dat) {
  N <- length(dat)
  na.pos <- which(is.na(dat))
  if (length(na.pos) %in% c(0, N)) {
    return(dat)
  }
  non.na.pos <- which(!is.na(dat))
  intervals  <- findInterval(na.pos, non.na.pos,
                             all.inside = TRUE)
  left.pos   <- non.na.pos[pmax(1, intervals)]
  right.pos  <- non.na.pos[pmin(N, intervals+1)]
  left.dist  <- na.pos - left.pos
  right.dist <- right.pos - na.pos

  dat[na.pos] <- ifelse(left.dist <= right.dist,
                        dat[left.pos], dat[right.pos])
  return(dat)
}

And here I test it:

# sample data, suggested by @JeffAllen
dat <- as.integer(runif(50000, min=0, max=10))
dat[dat==0] <- NA

# computation times
system.time(r0 <- f0(dat))    # your function
# user  system elapsed 
# 5.52    0.00    5.52
system.time(r1 <- f1(dat))    # this function
# user  system elapsed 
# 0.01    0.00    0.03
identical(r0, r1)
# [1] TRUE