merge.data.table with all=True introduces NA row. Is this correct?

Question 1

merge.data.table with all=True introduces NA row. Is this correct?

r data.table outer-join

vsalmendra · Mar 22, 2013 · Viewed 10.7k times · Source

Answer

Answer

The example in the question is far too simple to show the problem, hence the confusion and discussion. Using two one-column data.tables isn't enough to show what merge does!

Here's a better example :

> a = data.table(P=1:2,Q=3:4,key='P')
> b = data.table(P=2:3,R=5:6,key='P')
> a
   P Q
1: 1 3
2: 2 4
> b
   P R
1: 2 5
2: 3 6
> merge(a,b)  # correct
   P Q R
1: 2 4 5
> merge(a,b,all=TRUE)  # correct.  
   P  Q  R
1: 1  3 NA
2: 2  4  5
3: 3 NA  6
> merge(a,b[0],all=TRUE)  # incorrect result when y is empty, agreed
    P  Q  R
1: NA NA NA
2: NA NA NA
3:  1  3 NA
4:  2  4 NA
> merge.data.frame(a,b[0],all=TRUE)  # correct
  P Q  R
1 1 3 NA
2 2 4 NA

Ricardo got to the bottom of this and fixed it in v1.8.9. From NEWS :

merge no longer returns spurious NA row(s) when y is empty and all.y=TRUE (or all=TRUE), #2633. Thanks to Vinicius Almendra for reporting. Test added.

Question 2

Doing a merge between a populated data.table and another one that is empty introduces one NA row in the resulting data.table:

a = data.table(c=c(1,2),key='c')
b = data.table(c=3,key='c')
b=b[c!=3]
b
# Empty data.table (0 rows) of 1 col: c
merge(a,b,all=T)
#     c
# 1: NA
# 2:  1
# 3:  2

Why? I expected that it would return only the rows of data.table a, as it does with merge.data.frame:

> merge.data.frame(a,b,all=T,by='c')
#  c
#1 1
#2 2

merge.data.table with all=True introduces NA row. Is this correct?

Answer

Related questions