Suppose I want to calculate the proportion of different values within each group. For example, using the mtcars
data, how do I calculate the relative frequency of number of gears by am (automatic/manual) in one go with dplyr
?
library(dplyr)
data(mtcars)
mtcars <- tbl_df(mtcars)
# count frequency
mtcars %>%
group_by(am, gear) %>%
summarise(n = n())
# am gear n
# 0 3 15
# 0 4 4
# 1 4 8
# 1 5 5
What I would like to achieve:
am gear n rel.freq
0 3 15 0.7894737
0 4 4 0.2105263
1 4 8 0.6153846
1 5 5 0.3846154
Try this:
mtcars %>%
group_by(am, gear) %>%
summarise(n = n()) %>%
mutate(freq = n / sum(n))
# am gear n freq
# 1 0 3 15 0.7894737
# 2 0 4 4 0.2105263
# 3 1 4 8 0.6153846
# 4 1 5 5 0.3846154
From the dplyr vignette:
When you group by multiple variables, each summary peels off one level of the grouping. That makes it easy to progressively roll-up a dataset.
Thus, after the summarise
, the last grouping variable specified in group_by
, 'gear', is peeled off. In the mutate
step, the data is grouped by the remaining grouping variable(s), here 'am'. You may check grouping in each step with groups
.
The outcome of the peeling is of course dependent of the order of the grouping variables in the group_by
call. You may wish to do a subsequent group_by(am)
, to make your code more explicit.
For rounding and prettification, please refer to the nice answer by @Tyler Rinker.