R Left Outer Join with 0 Fill Instead of NA While Preserving Valid NA's in Left Table

Mekki MacAulay picture Mekki MacAulay · Feb 3, 2016 · Viewed 10.5k times · Source

What is the easiest way to do a left outer join on two data tables (dt1, dt2) with the fill value being 0 (or some other value) instead of NA (default) without overwriting valid NA values in the left data table?

A common answer, such as in this thread is to do the left outer join with either dplyr::left_join or data.table::merge or data.table's dt2[dt1] keyed column bracket syntax, followed by a second step simply replacing all NA values by 0 in the joined data table. For example:

library(data.table);
dt1 <- data.table(x=c('a', 'b', 'c', 'd', 'e'), y=c(NA, 'w', NA, 'y', 'z'));
dt2 <- data.table(x=c('a', 'b', 'c'), new_col=c(1,2,3));
setkey(dt1, x);
setkey(dt2, x);
merged_tables <- dt2[dt1];
merged_tables[is.na(merged_tables)] <- 0;

This approach necessarily assumes that there are no valid NA values in dt1 that need to be preserved. Yet, as you can see in the above example, the results are:

   x new_col y
1: a       1 0
2: b       2 w
3: c       3 0
4: d       0 y
5: e       0 z

but the desired results are:

   x new_col y
1: a       1 NA
2: b       2 w
3: c       3 NA
4: d       0 y
5: e       0 z

In such a trivial case, instead of using the data.table all elements replace syntax as above, just the NA values in new_col could be replaced:

library(dplyr);
merged_tables <- mutate(merged_tables, new_col = ifelse(is.na(new_col), 0, new_col));

However, this approach is not practical for very large data sets where dozens or hundreds of new columns are merged, sometimes with dynamically created column names. Even if the column names were all known ahead of time, it's very ugly to list out all the new columns and do a mutate-style replace on each one.

There must be a better way? The issue would be simply resolved if the syntax of any of dplyr::left_join, data.table::merge, or data.table's bracket easily allowed the user to specify a fill value other than NA. Something like:

merged_tables <- data.table::merge(dt1, dt2, by="x", all.x=TRUE, fill=0);

data.table's dcast function allows the user to specify fill value, so I figure there must be an easier way to do this that I'm just not thinking of.

Suggestions?

EDIT: @jangorecki pointed out in the comments that there is a feature request currently open on the data.table GitHug page to do exactly what I just mentioned, updating the nomatch=0 syntax. Should be in the next release of data.table.

Answer

Fernando Macedo picture Fernando Macedo · Jan 10, 2017

I stumbled on the same problem with dplyr and wrote a small function that solved my problem. (the solution requires tidyr and dplyr)

left_join0 <- function(x, y, fill = 0L){
  z <- left_join(x, y)
  tmp <- setdiff(names(z), names(x))
  z <- replace_na(z, setNames(as.list(rep(fill, length(tmp))), tmp))
  z
}