Summary statistics by two or more factor variables?

nzcoops picture nzcoops · Apr 19, 2012 · Viewed 54k times · Source

This is best illustrated with an example

str(mtcars)
mtcars$gear <- factor(mtcars$gear, labels=c("three","four","five"))
mtcars$cyl <- factor(mtcars$cyl, labels=c("four","six","eight"))
mtcars$am <- factor(mtcars$am, labels=c("manual","auto")
str(mtcars)
tapply(mtcars$mpg, mtcars$gear, sum)

That gives me the summed mpg per gear. But say I wanted a 3x3 table with gear across the top and cyl down the side, and 9 cells with the bivariate sums in, how would I get that 'smartly'.

I could go.

tapply(mtcars$mpg[mtcars$cyl=="four"], mtcars$gear[mtcars$cyl=="four"], sum)
tapply(mtcars$mpg[mtcars$cyl=="six"], mtcars$gear[mtcars$cyl=="six"], sum)
tapply(mtcars$mpg[mtcars$cyl=="eight"], mtcars$gear[mtcars$cyl=="eight"], sum)

This seems cumbersome.

Then how would I bring a 3rd variable in the mix?

This is somewhat in the space I'm thinking about. Summary statistics using ddply

update This gets me there, but it's not pretty.

aggregate(mpg ~ am+cyl+gear, mtcars,sum)

Cheers

Answer

Josh O&#39;Brien picture Josh O'Brien · Apr 19, 2012

How about this, still using tapply()? It's more versatile than you knew!

with(mtcars, tapply(mpg, list(cyl, gear), sum))
#       three  four five
# four   21.5 215.4 56.4
# six    39.5  79.0 19.7
# eight 180.6    NA 30.8

Or, if you'd like the printed output to be a bit more interpretable:

with(mtcars, tapply(mpg, list("Cylinder#"=cyl, "Gear#"=gear), sum))

If you want to use more than two cross-classifying variables, the idea's exactly the same. The results will then be returned in a 3-or-more-dimensional array:

A <- with(mtcars, tapply(mpg, list(cyl, gear, carb), sum))

dim(A)
# [1] 3 3 6
lapply(1:6, function(i) A[,,i]) # To convert results to a list of matrices

# But eventually, the curse of dimensionality will begin to kick in...
table(is.na(A))
# FALSE  TRUE 
#    12    42