I need to summarize a data frame by some variables, ignoring the others. This is sometimes referred to as collapsing. E.g. if I have a dataframe like this:
Widget Type Energy
egg 1 20
egg 2 30
jap 3 50
jap 1 60
Then collapsing by Widget, with Energy the dependent variable, Energy~Widget, would yield
Widget Energy
egg 25
jap 55
In Excel the closest functionality might be "Pivot tables" and I've worked out how to do it in python ( http://alexholcombe.wordpress.com/2009/01/26/summarizing-data-by-combinations-of-variables-with-python/), and here's an example with R using doBy library to do something very related ( http://www.mail-archive.com/[email protected]/msg02643.html), but is there an easy way to do the above? And even better is there anything built into the ggplot2 library to create plots that collapse across some variables?
Use aggregate
to summarize across a factor:
> df<-read.table(textConnection('
+ egg 1 20
+ egg 2 30
+ jap 3 50
+ jap 1 60'))
> aggregate(df$V3,list(df$V1),mean)
Group.1 x
1 egg 25
2 jap 55
For more flexibility look at the tapply
function and the plyr
package.
In ggplot2
use stat_summary
to summarize
qplot(V1,V3,data=df,stat="summary",fun.y=mean,geom='bar',width=0.4)