Group/bin/bucket data in R and get count per bucket and sum of values per bucket

Freewill picture Freewill · Jan 4, 2015 · Viewed 31.5k times · Source

I wish to bucket/group/bin data :

C1             C2       C3
49488.01172    0.0512   54000
268221.1563    0.0128   34399
34775.96094    0.0128   54444
13046.98047    0.07241  61000
2121699.75     0.00453  78921
71155.09375    0.0181   13794
1369809.875    0.00453  12312
750            0.2048   43451
44943.82813    0.0362   49871
85585.04688    0.0362   18947
31090.10938    0.0362   13401
68550.40625    0.0181   14345

I want to bucket it by C2 values but I wish to define the buckets e.g. <=0.005, <=.010, <=.014 etc. As you can see, the bucketing will be uneven intervals. I want the count of C1 per bucket as well as the total sum of C1 for every bucket.

I don't know where to begin as I am fairly new a user of R. Is there anyone willing to help me figure out the code or direct to me to an example that will work for my needs?

EDIT: added another column C3. I need sum of C3 per bucket as well at the same time as sum and count of C1 per bucket

Answer

akrun picture akrun · Jan 4, 2015

From the comments, "C2" seems to be "character" column with % as suffix. Before, creating a group, remove the % using sub, convert to "numeric" (as.numeric). The variable "group" is created (transform(df,...)) by using the function cut with breaks (group buckets/intervals) and labels (for the desired group labels) arguments. Once the group variable is created, the sum of the "C1" by "group" and the "count" of elements within "group" can be done using aggregate from "base R"

df1 <-  transform(df, group=cut(as.numeric(sub('[%]', '', C2)), 
    breaks=c(-Inf,0.005, 0.010, 0.014, Inf),
      labels=c('<0.005', 0.005, 0.01, 0.014)))

 res <- do.call(data.frame,aggregate(C1~group, df1, 
        FUN=function(x) c(Count=length(x), Sum=sum(x))))

 dNew <- data.frame(group=levels(df1$group))
 merge(res, dNew, all=TRUE)
 #   group C1.Count    C1.Sum
 #1 <0.005        2 3491509.6
 #2  0.005       NA        NA
 #3   0.01        2  302997.1
 #4  0.014        8  364609.5

or you can use data.table. setDT converts the data.frame to data.table. Specify the "grouping" variable with by= and summarize/create the two variables "Count" and "Sum" within the list(. .N gives the count of elements within each "group".

 library(data.table)
  setDT(df1)[, list(Count=.N, Sum=sum(C1)), by=group][]

Or using dplyr. The %>% connect the LHS with RHS arguments and chains them together. Use group_by to specify the "group" variable, and then use summarise_each or summarise to get summary count and sum of the concerned column. summarise_each would be useful if there are more than one column.

 library(dplyr)
 df1 %>%
      group_by(group) %>% 
      summarise_each(funs(n(), Sum=sum(.)), C1)

Update

Using the new dataset df

df1 <- transform(df, group=cut(C2,  breaks=c(-Inf,0.005, 0.010, 0.014, Inf),
                             labels=c('<0.005', 0.005, 0.01, 0.014)))

res <- do.call(data.frame,aggregate(cbind(C1,C3)~group, df1, 
       FUN=function(x) c(Count=length(x), Sum=sum(x))))
res
#  group C1.Count    C1.Sum C3.Count C3.Sum
#1 <0.005        2 3491509.6        2  91233
#2   0.01        2  302997.1        2  88843
#3  0.014        8  364609.5        8 268809

and you can do the merge as detailed above.

The dplyr approach would be the same except specifying the additional variable

 df1%>%
      group_by(group) %>%
       summarise_each(funs(n(), Sum=sum(.)), C1, C3)
 #Source: local data frame [3 x 5]

 #  group C1_n C3_n    C1_Sum C3_Sum
 #1 <0.005    2    2 3491509.6  91233
 #2   0.01    2    2  302997.1  88843
 #3  0.014    8    8  364609.5 268809

data

df <-structure(list(C1 = c(49488.01172, 268221.1563, 34775.96094, 
13046.98047, 2121699.75, 71155.09375, 1369809.875, 750, 44943.82813, 
85585.04688, 31090.10938, 68550.40625), C2 = c("0.0512%", "0.0128%", 
"0.0128%", "0.07241%", "0.00453%", "0.0181%", "0.00453%", "0.2048%", 
"0.0362%", "0.0362%", "0.0362%", "0.0181%")), .Names = c("C1", 
"C2"), row.names = c(NA, -12L), class = "data.frame")