How to specify a distance function for clustering?

Mark Morrisson picture Mark Morrisson · Nov 15, 2015 · Viewed 14.3k times · Source

I'd like to cluster points given to a custom distance and strangely, it seems that neither scipy nor sklearn clustering methods allow the specification of a distance function.

For instance, in sklearn.cluster.AgglomerativeClustering, the only thing I may do is enter an affinity matrix (which will be very memory-heavy). In order to build this very matrix, it is recommended to use sklearn.neighbors.kneighbors_graph, but I don't understand how I can specify a distance function either between two points. Could someone enlighten me?

Answer

ali_m picture ali_m · Nov 16, 2015

All of the scipy hierarchical clustering routines will accept a custom distance function that accepts two 1D vectors specifying a pair of points and returns a scalar. For example, using fclusterdata:

import numpy as np
from scipy.cluster.hierarchy import fclusterdata

# a custom function that just computes Euclidean distance
def mydist(p1, p2):
    diff = p1 - p2
    return np.vdot(diff, diff) ** 0.5

X = np.random.randn(100, 2)

fclust1 = fclusterdata(X, 1.0, metric=mydist)
fclust2 = fclusterdata(X, 1.0, metric='euclidean')

print(np.allclose(fclust1, fclust2))
# True

Valid inputs for the metric= kwarg are the same as for scipy.spatial.distance.pdist.