I have two data-frames df1
and df2
that each have around 10 million rows and 4 columns. I read them into R using RODBC/sqlQuery with no problems, but when I try to rbind
them, I get that most dreaded of R error messages: cannot allocate memory
. There have got to be more efficient ways to do an rbind
more efficiently -- anyone have their favorite tricks on this they want to share? For instance I found this example in the doc for sqldf
:
# rbind
a7r <- rbind(a5r, a6r)
a7s <- sqldf("select * from a5s union all select * from a6s")
Is that the best/recommended way to do it?
UPDATE
I got it to work using the crucial dbname = tempfile()
argument in the sqldf
call above, as JD Long suggests in his answer to this question
Rather than reading them into R at the beginning and then combining them you could have SQLite read them and combine them before sending them to R. That way the files are never individually loaded into R.
# create two sample files
DF1 <- data.frame(A = 1:2, B = 2:3)
write.table(DF1, "data1.dat", sep = ",", quote = FALSE)
rm(DF1)
DF2 <- data.frame(A = 10:11, B = 12:13)
write.table(DF2, "data2.dat", sep = ",", quote = FALSE)
rm(DF2)
# now we do the real work
library(sqldf)
data1 <- file("data1.dat")
data2 <- file("data2.dat")
sqldf(c("select * from data1",
"insert into data1 select * from data2",
"select * from data1"),
dbname = tempfile())
This gives:
> sqldf(c("select * from data1", "insert into data1 select * from data2", "select * from data1"), dbname = tempfile())
A B
1 1 2
2 2 3
3 10 12
4 11 13
This shorter version also works if row order is unimportant:
sqldf("select * from data1 union select * from data2", dbname = tempfile())
See the sqldf home page http://sqldf.googlecode.com and ?sqldf
for more info. Pay particular attention to the file format arguments since they are close but not identical to read.table
. Here we have used the defaults so it was less of an issue.