My question involves writing code using the dplyr package in R
I have a relatively large dataframe (approx 5 million rows) with 2 columns: the first with an individual identifier (id
), and a second with a date (date
). At present, each row indicates the occurrence of an action (taken by the individual in the id column) on the date in the date column. There are about 300,000 unique individuals, and about 2600 unique dates. For example, the beginning of the data look like this:
id date
John12 2006-08-03
Tom2993 2008-10-11
Lisa825 2009-07-03
Tom2993 2008-06-12
Andrew13 2007-09-11
I'd like to reshape the data so that I have a row for every possible id
x date
pair, with an additional column which counts the total number of events that occurred (perhaps taking the value 0) for the listed individual on the given date.
I've had some success with the dplyr package, which I've used to tabulate the id x date counts which are observed in the data.
Here's the code I've used to tabulate id
x date
counts so far: (my dataframe is called df)
reduced = df %.%
group_by(id, date) %.%
summarize(length(date))
My problem is that (as I said above) I'd like to have a dataset that also includes 0s for id x date pairs that don't have any associated actions. For example, if there's no observed action for John12 on 2007-10-10, I'd like the output to return a row for that id
x date
pair, with a count of 0.
I considered creating the frame above, then mergine with an empty frame, but I'm convinced there must be a simpler solution. Any suggestions much appreciated!
Here's a simple option, using data.table
instead:
library(data.table)
dt = as.data.table(your_df)
setkey(dt, id, date)
# in versions 1.9.3+
dt[CJ(unique(id), unique(date)), .N, by = .EACHI]
# id date N
# 1: Andrew13 2006-08-03 0
# 2: Andrew13 2007-09-11 1
# 3: Andrew13 2008-06-12 0
# 4: Andrew13 2008-10-11 0
# 5: Andrew13 2009-07-03 0
# 6: John12 2006-08-03 1
# 7: John12 2007-09-11 0
# 8: John12 2008-06-12 0
# 9: John12 2008-10-11 0
#10: John12 2009-07-03 0
#11: Lisa825 2006-08-03 0
#12: Lisa825 2007-09-11 0
#13: Lisa825 2008-06-12 0
#14: Lisa825 2008-10-11 0
#15: Lisa825 2009-07-03 1
#16: Tom2993 2006-08-03 0
#17: Tom2993 2007-09-11 0
#18: Tom2993 2008-06-12 1
#19: Tom2993 2008-10-11 1
#20: Tom2993 2009-07-03 0
In versions 1.9.2 or before the equivalent expression omits the explicit by
:
dt[CJ(unique(id), unique(date)), .N]
The idea is to create all possible pairs of id
and date
(which is what the CJ
part does), and then merge it back, counting occurrences.