Summary stats by factor level for multiple variables

Rory Shaw picture Rory Shaw · Nov 23, 2015 · Viewed 8.4k times · Source

I want to produce dataframes containing summary statistics for each factor level for multiple variables.

For example if I have the following dataframe

Factor <- c("A","A","A","B","B","B")
Variable1 <- c(3,4,5,4,5,3)
Variable2 <- c(7,9,14,16,10,10)
mydf <- data.frame(Factor, Variable1, Variable2)
mydf
  Factor Variable1 Variable2
1      A         3         7
2      A         4         9
3      A         5        14
4      B         4        16
5      B         5        10
6      B         3        10

and I have the following function that I want to use to produce my summary stats:

my.summary <- function(x, na.rm=TRUE){result <- c(n=as.integer(length(x)),
Mean=mean(x, na.rm=TRUE), SD=sd(x, na.rm=TRUE), SeM = SEM(x),
Median=median(x),   Min=min(x), Max=max(x))}

To apply this to factor levels of Variable1 I can do this:

ddply(mydf, c("Factor"), function(x) my.summary(x$Variable1))
  Factor n Mean SD       SeM Median Min Max
1      A 3    4  1 0.5773503      4   3   5
2      B 3    4  1 0.5773503      4   3   5

Now I can do the same for Variable 2:

ddply(mydf, c("Factor"), function(x) my.summary(x$Variable2))

Which is easy enough if I just have 2 variables. However, if I had lots of variables this would be a pain. So how can I solve this so that I can produce a dataframe of the summary stats for each variable/factor level without having to adjust the code?

I have tried using aggregate.data.frame but it doesn't work using my.summary. It works using summary but produces one big data frame.

Thanks

Answer

jeremycg picture jeremycg · Nov 23, 2015

You could use summarise_each from dplyr:

library(dplyr)

mydf %>% group_by(Factor) %>%
         summarise_each(funs(my.summary(.)))

After modifying your function to return a list:

my.summary <- function(x, na.rm=TRUE){result <- list(c(n=as.integer(length(x)),
                                                  Mean=mean(x, na.rm=TRUE), SD=sd(x, na.rm=TRUE),
                                                  Median=median(x),   Min=min(x), Max=max(x)))}