Using dplyr for frequency counts of interactions, must include zero counts

Mark T Patterson picture Mark T Patterson · May 21, 2014 · Viewed 9.7k times · Source

My question involves writing code using the dplyr package in R

I have a relatively large dataframe (approx 5 million rows) with 2 columns: the first with an individual identifier (id), and a second with a date (date). At present, each row indicates the occurrence of an action (taken by the individual in the id column) on the date in the date column. There are about 300,000 unique individuals, and about 2600 unique dates. For example, the beginning of the data look like this:

    id         date
    John12     2006-08-03
    Tom2993    2008-10-11
    Lisa825    2009-07-03
    Tom2993    2008-06-12
    Andrew13   2007-09-11

I'd like to reshape the data so that I have a row for every possible id x date pair, with an additional column which counts the total number of events that occurred (perhaps taking the value 0) for the listed individual on the given date.

I've had some success with the dplyr package, which I've used to tabulate the id x date counts which are observed in the data.

Here's the code I've used to tabulate id x date counts so far: (my dataframe is called df)

reduced = df %.% 
  group_by(id, date) %.%
  summarize(length(date))

My problem is that (as I said above) I'd like to have a dataset that also includes 0s for id x date pairs that don't have any associated actions. For example, if there's no observed action for John12 on 2007-10-10, I'd like the output to return a row for that id x date pair, with a count of 0.

I considered creating the frame above, then mergine with an empty frame, but I'm convinced there must be a simpler solution. Any suggestions much appreciated!

Answer

eddi picture eddi · May 21, 2014

Here's a simple option, using data.table instead:

library(data.table)

dt = as.data.table(your_df)

setkey(dt, id, date)

# in versions 1.9.3+
dt[CJ(unique(id), unique(date)), .N, by = .EACHI]
#          id       date N
# 1: Andrew13 2006-08-03 0
# 2: Andrew13 2007-09-11 1
# 3: Andrew13 2008-06-12 0
# 4: Andrew13 2008-10-11 0
# 5: Andrew13 2009-07-03 0
# 6:   John12 2006-08-03 1
# 7:   John12 2007-09-11 0
# 8:   John12 2008-06-12 0
# 9:   John12 2008-10-11 0
#10:   John12 2009-07-03 0
#11:  Lisa825 2006-08-03 0
#12:  Lisa825 2007-09-11 0
#13:  Lisa825 2008-06-12 0
#14:  Lisa825 2008-10-11 0
#15:  Lisa825 2009-07-03 1
#16:  Tom2993 2006-08-03 0
#17:  Tom2993 2007-09-11 0
#18:  Tom2993 2008-06-12 1
#19:  Tom2993 2008-10-11 1
#20:  Tom2993 2009-07-03 0

In versions 1.9.2 or before the equivalent expression omits the explicit by:

dt[CJ(unique(id), unique(date)), .N]

The idea is to create all possible pairs of id and date (which is what the CJ part does), and then merge it back, counting occurrences.