How can I collapse a dataframe by some variables, taking mean across others

Alex Holcombe picture Alex Holcombe · Apr 1, 2010 · Viewed 16.7k times · Source

I need to summarize a data frame by some variables, ignoring the others. This is sometimes referred to as collapsing. E.g. if I have a dataframe like this:

Widget Type Energy  
egg 1 20  
egg 2 30  
jap 3 50  
jap 1 60

Then collapsing by Widget, with Energy the dependent variable, Energy~Widget, would yield

Widget Energy  
egg  25  
jap  55  

In Excel the closest functionality might be "Pivot tables" and I've worked out how to do it in python ( http://alexholcombe.wordpress.com/2009/01/26/summarizing-data-by-combinations-of-variables-with-python/), and here's an example with R using doBy library to do something very related ( http://www.mail-archive.com/[email protected]/msg02643.html), but is there an easy way to do the above? And even better is there anything built into the ggplot2 library to create plots that collapse across some variables?

Answer

Jyotirmoy Bhattacharya picture Jyotirmoy Bhattacharya · Apr 1, 2010

Use aggregate to summarize across a factor:

> df<-read.table(textConnection('
+ egg 1 20
+ egg 2 30
+ jap 3 50
+ jap 1 60'))
> aggregate(df$V3,list(df$V1),mean)
  Group.1  x
1     egg 25
2     jap 55

For more flexibility look at the tapply function and the plyr package.

In ggplot2 use stat_summary to summarize

qplot(V1,V3,data=df,stat="summary",fun.y=mean,geom='bar',width=0.4)