I am working with a huge data table in R containing monthly measurements of temperature for multiple locations, taken by different sources.
The dataset looks like this:
library(data.table)
# Generate random data:
loc <- 1:10
dates <- seq(as.Date("2000-01-01"), as.Date("2004-12-31"), by="month")
mods <- c("A","B", "C", "D", "E")
temp <- runif(length(loc)*length(dates)*length(mods), min=0, max=30)
df <- data.table(expand.grid(Location=loc,Date=dates,Model=mods),Temperature=temp)
So basically, for location 1, I have measurements from january 2000 to december 2004 taken by model A. Then, I have measurements made by model B. And so on for models C, D and E. And then, so on for location 2 to location 10.
What I need to do is, instead of having five different temperature measurements (from the models), to take the mean temperature for all the models.
As a result, I would have, for each location and each date, not five but ONLY ONE temperature measurement (that would be a multi-model mean).
I tried this:
df2 <- df[, Mean:=mean(Temperature), by=list(Model, Location, Date)]
which didn't work as I expected. I would at least expect the resulting data table to be 1/5th the number of rows of the original table, since I am summarizing five measurements into a single one.
What am I doing wrong?
I don't think you generated your test data correctly. The function expand.grid()
takes a cartesian product of all arguments. I'm not sure why you included the Temperature=temp
argument in the expand.grid()
call; that duplicates each temperature value for every single key combination, resulting in a data.table with 9 million rows (this is (10*60*5)^2
). I think you intended one temperature value per key, which should result in 10*60*5
rows:
df <- data.table(expand.grid(Location=loc,Date=dates,Model=mods),Temperature=temp);
df;
## Location Date Model Temperature
## 1: 1 2000-01-01 A 2.469751
## 2: 2 2000-01-01 A 16.103135
## 3: 3 2000-01-01 A 7.147051
## 4: 4 2000-01-01 A 10.301937
## 5: 5 2000-01-01 A 16.760238
## ---
## 2996: 6 2004-12-01 E 26.293968
## 2997: 7 2004-12-01 E 8.446528
## 2998: 8 2004-12-01 E 29.003001
## 2999: 9 2004-12-01 E 12.076765
## 3000: 10 2004-12-01 E 28.410980
If this is correct, you can generate the means across models with this:
df[,.(Mean=mean(Temperature)),.(Location,Date)];
## Location Date Mean
## 1: 1 2000-01-01 9.498497
## 2: 2 2000-01-01 11.744622
## 3: 3 2000-01-01 15.691228
## 4: 4 2000-01-01 11.457154
## 5: 5 2000-01-01 8.897931
## ---
## 596: 6 2004-12-01 17.587000
## 597: 7 2004-12-01 19.555963
## 598: 8 2004-12-01 15.710465
## 599: 9 2004-12-01 15.322790
## 600: 10 2004-12-01 20.240392
Note that the :=
operator does not actually aggregate. It only adds, modifies, or deletes columns in the original data.table. It is possible to add a new column (or overwrite an old column) with duplications of an aggregated calculation (e.g. see http://www.r-bloggers.com/two-of-my-favorite-data-table-features/), but that's not what you want.
In general, when you aggregate a table of data, you are necessarily producing a new table that is reduced to one row per aggregation key. The :=
operator does not do this.
Instead, we need to run a normal index operation on the data.table, grouping by the required aggregation key (which will automatically be included in the output data.table), and add to that the j
argument which will be evaluated once for each group. The result will be a reduced version of the original table, with the results of all j
argument evaluations merged with their respective aggregation keys. Since our j
argument results in a scalar value for each group, our result will be one row per Location
/Date
aggregation key.