I have a list of many data.frames that I want to merge. The issue here is that each data.frame differs in terms of the number of rows and columns, but they all share the key variables (which I've called "var1"
and "var2"
in the code below). If the data.frames were identical in terms of columns, I could merely rbind
, for which plyr's rbind.fill would do the job, but that's not the case with these data.
Because the merge
command only works on 2 data.frames, I turned to the Internet for ideas. I got this one from here, which worked perfectly in R 2.7.2, which is what I had at the time:
merge.rec <- function(.list, ...){
if(length(.list)==1) return(.list[[1]])
Recall(c(list(merge(.list[[1]], .list[[2]], ...)), .list[-(1:2)]), ...)
}
And I would call the function like so:
df <- merge.rec(my.list, by.x = c("var1", "var2"),
by.y = c("var1", "var2"), all = T, suffixes=c("", ""))
But in any R version after 2.7.2, including 2.11 and 2.12, this code fails with the following error:
Error in match.names(clabs, names(xi)) :
names do not match previous names
(Incidently, I see other references to this error elsewhere with no resolution).
Is there any way to solve this?
Reduce makes this fairly easy:
merged.data.frame = Reduce(function(...) merge(..., all=T), list.of.data.frames)
Here's a fully example using some mock data:
set.seed(1)
list.of.data.frames = list(data.frame(x=1:10, a=1:10), data.frame(x=5:14, b=11:20), data.frame(x=sample(20, 10), y=runif(10)))
merged.data.frame = Reduce(function(...) merge(..., all=T), list.of.data.frames)
tail(merged.data.frame)
# x a b y
#12 12 NA 18 NA
#13 13 NA 19 NA
#14 14 NA 20 0.4976992
#15 15 NA NA 0.7176185
#16 16 NA NA 0.3841037
#17 19 NA NA 0.3800352
And here's an example using these data to replicate my.list
:
merged.data.frame = Reduce(function(...) merge(..., by=match.by, all=T), my.list)
merged.data.frame[, 1:12]
# matchname party st district chamber senate1993 name.x v2.x v3.x v4.x senate1994 name.y
#1 ALGIERE 200 RI 026 S NA <NA> NA NA NA NA <NA>
#2 ALVES 100 RI 019 S NA <NA> NA NA NA NA <NA>
#3 BADEAU 100 RI 032 S NA <NA> NA NA NA NA <NA>
Note: It looks like this is arguably a bug in merge
. The problem is there is no check that adding the suffixes (to handle overlapping non-matching names) actually makes them unique. At a certain point it uses [.data.frame
which does make.unique
the names, causing the rbind
to fail.
# first merge will end up with 'name.x' & 'name.y'
merge(my.list[[1]], my.list[[2]], by=match.by, all=T)
# [1] matchname party st district chamber senate1993 name.x
# [8] votes.year.x senate1994 name.y votes.year.y
#<0 rows> (or 0-length row.names)
# as there is no clash, we retain 'name.x' & 'name.y' and get 'name' again
merge(merge(my.list[[1]], my.list[[2]], by=match.by, all=T), my.list[[3]], by=match.by, all=T)
# [1] matchname party st district chamber senate1993 name.x
# [8] votes.year.x senate1994 name.y votes.year.y senate1995 name votes.year
#<0 rows> (or 0-length row.names)
# the next merge will fail as 'name' will get renamed to a pre-existing field.
Easiest way to fix is to not leave the field renaming for duplicates fields (of which there are many here) up to merge
. Eg:
my.list2 = Map(function(x, i) setNames(x, ifelse(names(x) %in% match.by,
names(x), sprintf('%s.%d', names(x), i))), my.list, seq_along(my.list))
The merge
/Reduce
will then work fine.