Correlation between groups in R data.table

Bram Visser picture Bram Visser · Mar 15, 2014 · Viewed 9k times · Source

Is there a way of elegantly calculating the correlations between values if those values are stored by group in a single column of a data.table (other than converting the data.table to a matrix)?

library(data.table)
set.seed(1)             # reproducibility
dt <- data.table(id=1:4, group=rep(letters[1:2], c(4,4)), value=rnorm(8))
setkey(dt, group)

#    id group      value
# 1:  1     a -0.6264538
# 2:  2     a  0.1836433
# 3:  3     a -0.8356286
# 4:  4     a  1.5952808
# 5:  1     b  0.3295078
# 6:  2     b -0.8204684
# 7:  3     b  0.4874291
# 8:  4     b  0.7383247

Something that works, but requires the group names as input:

cor(dt["a"]$value, dt["b"]$value)
# [1] 0.1556371

I'm looking more for something like:

dt[, cor(value, value), by="group"]

But that does not give me the correlation(s) I'm after.

Here's the same problem for a matrix with the correct results.

set.seed(1)             # reproducibility
m <- matrix(rnorm(8), ncol=2)
dimnames(m) <- list(id=1:4, group=letters[1:2])

#        group
# id           a          b
#   1 -0.6264538  0.3295078
#   2  0.1836433 -0.8204684
#   3 -0.8356286  0.4874291
#   4  1.5952808  0.7383247

cor(m)                  # correlations between groups

#           a         b
# a 1.0000000 0.1556371
# b 0.1556371 1.0000000

Any comments or help greatly appreciated.

Answer

Korone picture Korone · Oct 14, 2014

I've since found an even simple alternative for doing this. You were actually pretty close with your dt[, cor(value, value), by="group"] approach. What you actually need is to first do a Cartesian join on the dates, and then group by. I.e.

dt[dt, allow.cartesian=T][, cor(value, value), by=list(group, group.1)]

This has the advantage that it will join the series together (rather than assume they are the same length). You can then cast this into matrix form, or leave it as it is to plot as a heatmap in ggplot etc.

Full Example

setkey(dt, id)
c <- dt[dt, allow.cartesian=T][, list(Cor = cor(value, value.1)), by = list(group, group.1)]
c

   group group.1       Cor
1:     a       a 1.0000000
2:     b       a 0.1556371
3:     a       b 0.1556371
4:     b       b 1.0000000

dcast(c, group~group.1, value.var = "Cor")

  group         a         b
1     a 1.0000000 0.1556371
2     b 0.1556371 1.0000000