What is the easiest way to do a left outer join on two data tables (dt1, dt2) with the fill value being 0 (or some other value) instead of NA (default) without overwriting valid NA values in the left data table?
A common answer, such as in this thread is to do the left outer join with either dplyr::left_join
or data.table::merge
or data.table
's dt2[dt1] keyed column bracket syntax, followed by a second step simply replacing all NA
values by 0
in the joined data table. For example:
library(data.table);
dt1 <- data.table(x=c('a', 'b', 'c', 'd', 'e'), y=c(NA, 'w', NA, 'y', 'z'));
dt2 <- data.table(x=c('a', 'b', 'c'), new_col=c(1,2,3));
setkey(dt1, x);
setkey(dt2, x);
merged_tables <- dt2[dt1];
merged_tables[is.na(merged_tables)] <- 0;
This approach necessarily assumes that there are no valid NA values in dt1
that need to be preserved. Yet, as you can see in the above example, the results are:
x new_col y
1: a 1 0
2: b 2 w
3: c 3 0
4: d 0 y
5: e 0 z
but the desired results are:
x new_col y
1: a 1 NA
2: b 2 w
3: c 3 NA
4: d 0 y
5: e 0 z
In such a trivial case, instead of using the data.table
all elements replace syntax as above, just the NA values in new_col
could be replaced:
library(dplyr);
merged_tables <- mutate(merged_tables, new_col = ifelse(is.na(new_col), 0, new_col));
However, this approach is not practical for very large data sets where dozens or hundreds of new columns are merged, sometimes with dynamically created column names. Even if the column names were all known ahead of time, it's very ugly to list out all the new columns and do a mutate-style replace on each one.
There must be a better way? The issue would be simply resolved if the syntax of any of dplyr::left_join
, data.table::merge
, or data.table
's bracket easily allowed the user to specify a fill
value other than NA. Something like:
merged_tables <- data.table::merge(dt1, dt2, by="x", all.x=TRUE, fill=0);
data.table
's dcast
function allows the user to specify fill
value, so I figure there must be an easier way to do this that I'm just not thinking of.
Suggestions?
EDIT: @jangorecki pointed out in the comments that there is a feature request currently open on the data.table
GitHug page to do exactly what I just mentioned, updating the nomatch=0
syntax. Should be in the next release of data.table
.
I stumbled on the same problem with dplyr and wrote a small function that solved my problem. (the solution requires tidyr and dplyr)
left_join0 <- function(x, y, fill = 0L){
z <- left_join(x, y)
tmp <- setdiff(names(z), names(x))
z <- replace_na(z, setNames(as.list(rep(fill, length(tmp))), tmp))
z
}