Is there a way of elegantly calculating the correlations between values if those values are stored by group in a single column of a data.table (other than converting the data.table to a matrix)?
library(data.table)
set.seed(1) # reproducibility
dt <- data.table(id=1:4, group=rep(letters[1:2], c(4,4)), value=rnorm(8))
setkey(dt, group)
# id group value
# 1: 1 a -0.6264538
# 2: 2 a 0.1836433
# 3: 3 a -0.8356286
# 4: 4 a 1.5952808
# 5: 1 b 0.3295078
# 6: 2 b -0.8204684
# 7: 3 b 0.4874291
# 8: 4 b 0.7383247
Something that works, but requires the group names as input:
cor(dt["a"]$value, dt["b"]$value)
# [1] 0.1556371
I'm looking more for something like:
dt[, cor(value, value), by="group"]
But that does not give me the correlation(s) I'm after.
Here's the same problem for a matrix with the correct results.
set.seed(1) # reproducibility
m <- matrix(rnorm(8), ncol=2)
dimnames(m) <- list(id=1:4, group=letters[1:2])
# group
# id a b
# 1 -0.6264538 0.3295078
# 2 0.1836433 -0.8204684
# 3 -0.8356286 0.4874291
# 4 1.5952808 0.7383247
cor(m) # correlations between groups
# a b
# a 1.0000000 0.1556371
# b 0.1556371 1.0000000
Any comments or help greatly appreciated.
I've since found an even simple alternative for doing this. You were actually pretty close with your dt[, cor(value, value), by="group"]
approach. What you actually need is to first do a Cartesian join on the dates, and then group by.
I.e.
dt[dt, allow.cartesian=T][, cor(value, value), by=list(group, group.1)]
This has the advantage that it will join the series together (rather than assume they are the same length). You can then cast this into matrix form, or leave it as it is to plot as a heatmap in ggplot etc.
Full Example
setkey(dt, id)
c <- dt[dt, allow.cartesian=T][, list(Cor = cor(value, value.1)), by = list(group, group.1)]
c
group group.1 Cor
1: a a 1.0000000
2: b a 0.1556371
3: a b 0.1556371
4: b b 1.0000000
dcast(c, group~group.1, value.var = "Cor")
group a b
1 a 1.0000000 0.1556371
2 b 0.1556371 1.0000000