Ways to calculate similarity

MarySheen picture MarySheen · Jun 5, 2010 · Viewed 10k times · Source

I am doing a community website that requires me to calculate the similarity between any two users. Each user is described with the following attributes:

age, skin type (oily, dry), hair type (long, short, medium), lifestyle (active outdoor lover, TV junky) and others.

Can anyone tell me how to go about this problem or point me to some resources?

Answer

George Dontas picture George Dontas · Jun 6, 2010

Another way of computing (in R) all the pairwise dissimilarities (distances) between observations in the data set. The original variables may be of mixed types. The handling of nominal, ordinal, and (a)symmetric binary data is achieved by using the general dissimilarity coefficient of Gower (Gower, J. C. (1971) A general coefficient of similarity and some of its properties, Biometrics 27, 857–874). For more check out this on page 47. If x contains any columns of these data-types, Gower's coefficient will be used as the metric.

For example

x1 <- factor(c(10, 12, 25, 14, 29))
x2 <- factor(c("oily", "dry", "dry", "dry", "oily"))
x3 <- factor(c("medium", "short", "medium", "medium", "long"))
x4 <- factor(c("active outdoor lover", "TV junky", "TV junky", "active outdoor lover", "TV junky"))
x <- cbind(x1,x2,x3,x4)

library(cluster)
daisy(x, metric = "euclidean")

you'll get :

Dissimilarities :
         1        2        3        4
2 2.000000                           
3 3.316625 2.236068                  
4 2.236068 1.732051 1.414214         
5 4.242641 3.741657 1.732051 2.645751

If you are interested on a method for dimensionality reduction for categorical data (also a way to arrange variables into homogeneous clusters) check this