I have a dataset that contains both categorical (nominal and ordinal) and numerical attributes. I want to calculate the (dis)similarity matrix across my observations using these mixed attributes. Using the daisy() function of the cluster package in R, I can easily get a dissimilarity matrix as follows:
if(!require("cluster")) { install.packages("cluster"); require("cluster") }
data(flower)
as.matrix(daisy(flower, metric = "gower"))
This uses the gower metric to deal with the nominal variables. Is there a Python equivalent of the daisy()
function in R?
Or maybe any other module function that allows using the Gower metric or something similar to calculate the (dis)similarity matrix for a dataset with mixed (nominal, numeric) attributes?
Just to implement a Gower function to use with pdist won´t be enough.
Internally the pdist makes several numerical transformations that will fail if you use a matrix with mixed data.
I implemented the Gower function, according the original paper, and the respective adptations necessary in the pdist module (I could not simply override the functions, because the defs in the pdist module are private).
The results I obtained with this so far are the same from R´s daisy function.
The source code is avilable at this jupyter notebook: https://sourceforge.net/projects/gower-distance-4python/files/