Replacing NAs in R with nearest value

Question 1

Replacing NAs in R with nearest value

r na missing-data

geoffjentry · Apr 9, 2012 · Viewed 8.1k times · Source

Answer

Answer

Here is a very fast one. It uses findInterval to find what two positions should be considered for each NA in your original data:

f1 <- function(dat) {
  N <- length(dat)
  na.pos <- which(is.na(dat))
  if (length(na.pos) %in% c(0, N)) {
    return(dat)
  }
  non.na.pos <- which(!is.na(dat))
  intervals  <- findInterval(na.pos, non.na.pos,
                             all.inside = TRUE)
  left.pos   <- non.na.pos[pmax(1, intervals)]
  right.pos  <- non.na.pos[pmin(N, intervals+1)]
  left.dist  <- na.pos - left.pos
  right.dist <- right.pos - na.pos

  dat[na.pos] <- ifelse(left.dist <= right.dist,
                        dat[left.pos], dat[right.pos])
  return(dat)
}

And here I test it:

# sample data, suggested by @JeffAllen
dat <- as.integer(runif(50000, min=0, max=10))
dat[dat==0] <- NA

# computation times
system.time(r0 <- f0(dat))    # your function
# user  system elapsed 
# 5.52    0.00    5.52
system.time(r1 <- f1(dat))    # this function
# user  system elapsed 
# 0.01    0.00    0.03
identical(r0, r1)
# [1] TRUE

Question 2

I'm looking for something similar to na.locf() in the zoo package, but instead of always using the previous non-NA value I'd like to use the nearest non-NA value. Some example data:

dat <- c(1, 3, NA, NA, 5, 7)

Replacing NA with na.locf (3 is carried forward):

library(zoo)
na.locf(dat)
# 1 3 3 3 5 7

and na.locf with fromLast set to TRUE (5 is carried backwards):

na.locf(dat, fromLast = TRUE)
# 1 3 5 5 5 7

But I wish the nearest non-NA value to be used. In my example this means that the 3 should be carried forward to the first NA, and the 5 should be carried backwards to the second NA:

1 3 3 5 5 7

I have a solution coded up, but wanted to make sure that I wasn't reinventing the wheel. Is there something already floating around?

FYI, my current code is as follows. Perhaps if nothing else, someone can suggest how to make it more efficient. I feel like I'm missing an obvious way to improve this:

  na.pos <- which(is.na(dat))
  if (length(na.pos) == length(dat)) {
    return(dat)
  }
  non.na.pos <- setdiff(seq_along(dat), na.pos)
  nearest.non.na.pos <- sapply(na.pos, function(x) {
    return(which.min(abs(non.na.pos - x)))
  })
  dat[na.pos] <- dat[non.na.pos[nearest.non.na.pos]]

To answer smci's questions below:

No, any entry can be NA
If all are NA, leave them as is
No. My current solution defaults to the lefthand nearest value, but it doesn't matter
These rows are a few hundred thousand elements typically, so in theory the upper bound would be a few hundred thousand. In reality it'd be no more than a few here & there, typically a single one.

Update So it turns out that we're going in a different direction altogether but this was still an interesting discussion. Thanks all!

Replacing NAs in R with nearest value

Answer

Related questions