aggregate methods treat missing values (NA) differently

Ryan Walker picture Ryan Walker · May 30, 2013 · Viewed 87.5k times · Source

Here's a simple data frame with a missing value:

M = data.frame( Name = c('name', 'name'), Col1 = c(NA, 1) , Col2 = c(1, 1))
#   Name Col1 Col2
# 1 name   NA    1
# 2 name    1    1

When I use aggregate to sum variables by group ('Name') using the formula method:

aggregate(. ~ Name, M, FUN = sum, na.rm = TRUE)

the result is:

# RowName Col1 Col2
#    name    1    1

So the entire first row, which have an NA, is ignored. But if use the "non-formula" specification:

aggregate(M[, 2:3], by = list(M$Name), FUN = sum, na.rm = TRUE)

the result is:

# Group.1 Col1 Col2
#    name    1    2

Here only the (1,1) entry is ignored.

This caused a major debugging headache in one of my codes, since I thought these two calls were equivalent. Is there a good reason why the formula entry method is treated differently?

Thanks.

Answer

A5C1D2H2I1M1N2O1R2T1 picture A5C1D2H2I1M1N2O1R2T1 · May 30, 2013

Good question, but in my opinion, this shouldn't have caused a major debugging headache because it is documented quite clearly in multiple places in the manual page for aggregate.

First, in the usage section:

## S3 method for class 'formula'
aggregate(formula, data, FUN, ...,
          subset, na.action = na.omit)

Later, in the description:

na.action: a function which indicates what should happen when the data contain NA values. The default is to ignore missing values in the given variables.


I can't answer why the formula mode was written differently---that's something the function authors would have to answer---but using the above information, you can probably use the following:

aggregate(.~Name, M, FUN=sum, na.rm=TRUE, na.action=NULL)
#   Name Col1 Col2
# 1 name    1    2