Several SQL languages (I mostly use postgreSQL) have a function called coalesce which returns the first non null column element for each row. This can be very efficient to use when tables have a lot of NULL
elements in them.
I encounter this in a lot of scenarios in R as well when dealing with not so structured data which has a lot of NA's in them.
I have made a naive implementation myself but it is ridiculously slow.
coalesce <- function(...) {
apply(cbind(...), 1, function(x) {
x[which(!is.na(x))[1]]
})
}
a <- c(1, 2, NA, 4, NA)
b <- c(NA, NA, NA, 5, 6)
c <- c(7, 8, NA, 9, 10)
coalesce(a,b,c)
# [1] 1 2 NA 4 6
Is there any efficient way to implement coalesce
in R?
On my machine, using Reduce
gets a 5x performance improvement:
coalesce2 <- function(...) {
Reduce(function(x, y) {
i <- which(is.na(x))
x[i] <- y[i]
x},
list(...))
}
> microbenchmark(coalesce(a,b,c),coalesce2(a,b,c))
Unit: microseconds
expr min lq median uq max neval
coalesce(a, b, c) 97.669 100.7950 102.0120 103.0505 243.438 100
coalesce2(a, b, c) 19.601 21.4055 22.8835 23.8315 45.419 100