How to create a distance matrix for clustering using correlation instead of euclidean distance in R?

umair durrani picture umair durrani · May 18, 2015 · Viewed 9.1k times · Source

Goal

I want to do hierarchical clustering of samples (rows) in my data set.

What I know:

I have seen examples where distance matrices are created using euclidean distance, etc by employing dist() function in R. I have also seen correlation being used for creating dissimilarity (or similarity measure) between variables (columns).

What I want to do?

I want to create a distance matrix for the ROWS in the data using correlation. So, instead of euclidean distance in dist(), I want to use the correlation between each of the rows. But the available methods don't include correlation. Is there any way I could do that? This might not be a common practice but I think it's appropriate for my application.

Answer

chappers picture chappers · May 18, 2015

I think you're a bit confused by what is a distance metric. A distance metric cannot be negative, yet we know that correlation can definitely be negative. Nevertheless I will try to answer the gist of your question.

Basically you want to find whether two variables are similar by using some method of distance and correlation. This can easily be visualised using the corrplot library. So using a dataset from the mlbench library as an example, we can visualise this as follows:

library(mlbench)
library(corrplot)
data(PimaIndiansDiabetes)
plot1 <- corrplot(cor(PimaIndiansDiabetes[,!(names(PimaIndiansDiabetes) %in% c("diabetes"))]), 
                  method="square",
                  order="hclust", tl.cex=0.7, cl.cex=0.5, tl.col="black", addrect=2)

enter image description here

And here we can I have highlighted two groups of similar variables using hclust using correlation as a measure of similarity.

If you want to use the base libraries to see what the dendograms look like, this can be easily achieved as well:

cor.info <- cor(PimaIndiansDiabetes[,!(names(PimaIndiansDiabetes) %in% c("diabetes"))])
sim.by.hclust <- hclust(dist(cor.info))
plot(sim.by.hclust)

enter image description here

Here we can see how the variables are grouped together by using the correlation matrix directly. Note that in this example correlation is not the distance metric!

Hope this answers your question...


If you want to do the information on Rows, simply use t(), so using the same information above we have:

data(PimaIndiansDiabetes)
tdat <- t(PimaIndiansDiabetes[,!(names(PimaIndiansDiabetes) %in% c("diabetes"))])
cor.tdat <- cor(tdat)
sim.by.hclust <- hclust(dist(cor.tdat))
plot(sim.by.hclust)