I wish to bucket/group/bin data :
C1 C2 C3
49488.01172 0.0512 54000
268221.1563 0.0128 34399
34775.96094 0.0128 54444
13046.98047 0.07241 61000
2121699.75 0.00453 78921
71155.09375 0.0181 13794
1369809.875 0.00453 12312
750 0.2048 43451
44943.82813 0.0362 49871
85585.04688 0.0362 18947
31090.10938 0.0362 13401
68550.40625 0.0181 14345
I want to bucket it by C2 values but I wish to define the buckets e.g. <=0.005, <=.010, <=.014 etc. As you can see, the bucketing will be uneven intervals. I want the count of C1 per bucket as well as the total sum of C1 for every bucket.
I don't know where to begin as I am fairly new a user of R. Is there anyone willing to help me figure out the code or direct to me to an example that will work for my needs?
EDIT: added another column C3. I need sum of C3 per bucket as well at the same time as sum and count of C1 per bucket
From the comments, "C2" seems to be "character" column with %
as suffix. Before, creating a group, remove the %
using sub
, convert to "numeric" (as.numeric
). The variable "group" is created (transform(df,...)
) by using the function cut
with breaks
(group buckets/intervals) and labels
(for the desired group labels) arguments. Once the group variable is created, the sum
of the "C1" by "group" and the "count" of elements within "group" can be done using aggregate
from "base R"
df1 <- transform(df, group=cut(as.numeric(sub('[%]', '', C2)),
breaks=c(-Inf,0.005, 0.010, 0.014, Inf),
labels=c('<0.005', 0.005, 0.01, 0.014)))
res <- do.call(data.frame,aggregate(C1~group, df1,
FUN=function(x) c(Count=length(x), Sum=sum(x))))
dNew <- data.frame(group=levels(df1$group))
merge(res, dNew, all=TRUE)
# group C1.Count C1.Sum
#1 <0.005 2 3491509.6
#2 0.005 NA NA
#3 0.01 2 302997.1
#4 0.014 8 364609.5
or you can use data.table
. setDT
converts the data.frame
to data.table
. Specify the "grouping" variable with by=
and summarize/create the two variables "Count" and "Sum" within the list(
. .N
gives the count of elements within each "group".
library(data.table)
setDT(df1)[, list(Count=.N, Sum=sum(C1)), by=group][]
Or using dplyr
. The %>%
connect the LHS with RHS arguments and chains them together. Use group_by
to specify the "group" variable, and then use summarise_each
or summarise
to get summary count and sum
of the concerned column. summarise_each
would be useful if there are more than one column.
library(dplyr)
df1 %>%
group_by(group) %>%
summarise_each(funs(n(), Sum=sum(.)), C1)
Using the new dataset df
df1 <- transform(df, group=cut(C2, breaks=c(-Inf,0.005, 0.010, 0.014, Inf),
labels=c('<0.005', 0.005, 0.01, 0.014)))
res <- do.call(data.frame,aggregate(cbind(C1,C3)~group, df1,
FUN=function(x) c(Count=length(x), Sum=sum(x))))
res
# group C1.Count C1.Sum C3.Count C3.Sum
#1 <0.005 2 3491509.6 2 91233
#2 0.01 2 302997.1 2 88843
#3 0.014 8 364609.5 8 268809
and you can do the merge
as detailed above.
The dplyr
approach would be the same except specifying the additional variable
df1%>%
group_by(group) %>%
summarise_each(funs(n(), Sum=sum(.)), C1, C3)
#Source: local data frame [3 x 5]
# group C1_n C3_n C1_Sum C3_Sum
#1 <0.005 2 2 3491509.6 91233
#2 0.01 2 2 302997.1 88843
#3 0.014 8 8 364609.5 268809
df <-structure(list(C1 = c(49488.01172, 268221.1563, 34775.96094,
13046.98047, 2121699.75, 71155.09375, 1369809.875, 750, 44943.82813,
85585.04688, 31090.10938, 68550.40625), C2 = c("0.0512%", "0.0128%",
"0.0128%", "0.07241%", "0.00453%", "0.0181%", "0.00453%", "0.2048%",
"0.0362%", "0.0362%", "0.0362%", "0.0181%")), .Names = c("C1",
"C2"), row.names = c(NA, -12L), class = "data.frame")