How can I perform full outer joins of large data sets in R?

Drew75 picture Drew75 · Jun 6, 2013 · Viewed 8.5k times · Source

I am trying to do data analysis in R on a group of medium sized datasets. One of the analyses I need to do requires me to do a full outer join amongst around 24-48 files, each of with has about 60 columns and up to 450,000 lines. So I've been running into memory issues a lot.

I thought at ffbase or sqldf would help, but apparently full outer join is not possible with either of them.

Is there a workaround? A package I haven't found yet?

Answer

joran picture joran · Jun 8, 2013

Here is a simple example that illustrates how to do outer joins of several datasets:

library(sqldf)
dat1 <- data.frame(x = 1:5,y = letters[1:5])
dat2 <- data.frame(w = 3:8,z = letters[3:8])
> 
> sqldf("select * from dat1 left outer join dat2 on dat1.x = dat2.w UNION 
+       select * from dat2 left outer join dat1 on dat1.x = dat2.w")
  x y  w    z
1 1 a NA <NA>
2 2 b NA <NA>
3 3 c  3    c
4 4 d  4    d
5 5 e  5    e
6 6 f NA <NA>
7 7 g NA <NA>
8 8 h NA <NA>

There it is, a full outer join using sqldf and SQLite as a backend.

As I also mentioned, sqldf support more back ends than SQLite. A single Google search reveals that full outer joins are accomplished the exact same way in MySQL. I am less familiar with postgres but this question sure suggests that full outer joins are possible there as well.