sum multiple columns by group with tapply

Mark Miller picture Mark Miller · Jul 28, 2013 · Viewed 29.2k times · Source

I wanted to sum individual columns by group and my first thought was to use tapply. However, I cannot get tapply to work. Can tapply be used to sum multiple columns? If not, why not?

I have searched the internet extensively and found numerous similar questions posted as far back as 2008. However, none of those questions have been answered directly. Instead, the responses invariably suggest using a different function.

Below is an example data set for which I wish to sum apples by state, cherries by state and plums by state. Below that I have compiled numerous alternatives to tapply that do work.

At the bottom I show a simple modification to the tapply source code that allows tapply to perform the desired operation.

Nevertheless, perhaps I am overlooking a simple way to perform the desired operation with tapply. I am not looking for alternative functions, although additional alternatives are welcome.

Given the simplicity of my modification to the tapply source code I wonder why it, or something similar, has not already been implemented.

Thank you for any advice. If my question is a duplicate I will be happy to post my question as an answer to that other question.

Here is the example data set:

df.1 <- read.table(text = '

    state   county   apples   cherries   plums
       AA        1        1          2       3
       AA        2       10         20      30
       AA        3      100        200     300
       BB        7       -1         -2      -3
       BB        8      -10        -20     -30
       BB        9     -100       -200    -300

', header = TRUE, stringsAsFactors = FALSE)

This does not work:

tapply(df.1, df.1$state, function(x) {colSums(x[,3:5])})

The help pages says:

tapply(X, INDEX, FUN = NULL, ..., simplify = TRUE)

X       an atomic object, typically a vector.

I was confused by the phrase typically a vector which made me wonder whether a data frame could be used. I have never been clear on what atomic object means.

Here are several alternatives to tapply that do work. The first alternative is a work-around that combines tapply with apply.

apply(df.1[,c(3:5)], 2, function(x) tapply(x, df.1$state, sum))

#    apples cherries plums
# AA    111      222   333
# BB   -111     -222  -333

with(df.1, aggregate(df.1[,3:5], data.frame(state), sum))

#   state apples cherries plums
# 1    AA    111      222   333
# 2    BB   -111     -222  -333

t(sapply(split(df.1[,3:5], df.1$state), colSums))

#    apples cherries plums
# AA    111      222   333
# BB   -111     -222  -333

t(sapply(split(df.1[,3:5], df.1$state), function(x) apply(x, 2, sum)))

#    apples cherries plums
# AA    111      222   333
# BB   -111     -222  -333

aggregate(df.1[,3:5], by=list(df.1$state), sum)

#   Group.1 apples cherries plums
# 1      AA    111      222   333
# 2      BB   -111     -222  -333

by(df.1[,3:5], df.1$state, colSums)

# df.1$state: AA
#   apples cherries    plums 
#      111      222      333 
# ------------------------------------------------------------ 
# df.1$state: BB
#   apples cherries    plums 
#     -111     -222     -333

with(df.1, 
     aggregate(x = list(apples   = apples, 
                        cherries = cherries,
                        plums    = plums), 
               by = list(state   = state), 
               FUN = function(x) sum(x)))

#   state apples cherries plums
# 1    AA    111      222   333
# 2    BB   -111     -222  -333

lapply(split(df.1, df.1$state), function(x) {colSums(x[,3:5])} )

# $AA
#   apples cherries    plums 
#      111      222      333 
#
# $BB
#   apples cherries    plums 
#     -111     -222     -333

Here is the source code for tapply except that I changed the line:

nx <- length(X)

to:

nx <- ifelse(is.vector(X), length(X), dim(X)[1])

This modified version of tapply performs the desired operation:

my.tapply <- function (X, INDEX, FUN = NULL, ..., simplify = TRUE)
{
    FUN <- if (!is.null(FUN)) match.fun(FUN)
    if (!is.list(INDEX)) INDEX <- list(INDEX)
    nI <- length(INDEX)
    if (!nI) stop("'INDEX' is of length zero")
    namelist <- vector("list", nI)
    names(namelist) <- names(INDEX)
    extent <- integer(nI)
    nx     <- ifelse(is.vector(X), length(X), dim(X)[1])  # replaces nx <- length(X)
    one <- 1L
    group <- rep.int(one, nx) #- to contain the splitting vector
    ngroup <- one
    for (i in seq_along(INDEX)) {
    index <- as.factor(INDEX[[i]])
    if (length(index) != nx)
        stop("arguments must have same length")
    namelist[[i]] <- levels(index)#- all of them, yes !
    extent[i] <- nlevels(index)
    group <- group + ngroup * (as.integer(index) - one)
    ngroup <- ngroup * nlevels(index)
    }
    if (is.null(FUN)) return(group)
    ans <- lapply(X = split(X, group), FUN = FUN, ...)
    index <- as.integer(names(ans))
    if (simplify && all(unlist(lapply(ans, length)) == 1L)) {
    ansmat <- array(dim = extent, dimnames = namelist)
    ans <- unlist(ans, recursive = FALSE)
    } else {
    ansmat <- array(vector("list", prod(extent)),
            dim = extent, dimnames = namelist)
    }
    if(length(index)) {
        names(ans) <- NULL
        ansmat[index] <- ans
    }
    ansmat
}

my.tapply(df.1$apples, df.1$state, function(x) {sum(x)})

#  AA   BB 
# 111 -111

my.tapply(df.1[,3:4] , df.1$state, function(x) {colSums(x)})

# $AA
#   apples cherries 
#      111      222 
#
# $BB
#   apples cherries 
#     -111     -222

Answer

EDi picture EDi · Jul 28, 2013

tapply works on a vector, for a data.frame you can use by (which is a wrapper for tapply, take a look at the code):

> by(df.1[,c(3:5)], df.1$state, FUN=colSums)
df.1$state: AA
  apples cherries    plums 
     111      222      333 
------------------------------------------------------------------------------------- 
df.1$state: BB
  apples cherries    plums 
    -111     -222     -333