Python equivalent of daisy() in the cluster package of R

Zhubarb picture Zhubarb · Oct 15, 2014 · Viewed 11.5k times · Source

I have a dataset that contains both categorical (nominal and ordinal) and numerical attributes. I want to calculate the (dis)similarity matrix across my observations using these mixed attributes. Using the daisy() function of the cluster package in R, I can easily get a dissimilarity matrix as follows:

if(!require("cluster")) { install.packages("cluster");  require("cluster") }
data(flower)
as.matrix(daisy(flower, metric = "gower"))

This uses the gower metric to deal with the nominal variables. Is there a Python equivalent of the daisy() function in R?

Or maybe any other module function that allows using the Gower metric or something similar to calculate the (dis)similarity matrix for a dataset with mixed (nominal, numeric) attributes?

Answer

Marcelo Beckmann picture Marcelo Beckmann · Jan 17, 2017

Just to implement a Gower function to use with pdist won´t be enough.

Internally the pdist makes several numerical transformations that will fail if you use a matrix with mixed data.

I implemented the Gower function, according the original paper, and the respective adptations necessary in the pdist module (I could not simply override the functions, because the defs in the pdist module are private).

The results I obtained with this so far are the same from R´s daisy function.

The source code is avilable at this jupyter notebook: https://sourceforge.net/projects/gower-distance-4python/files/