I'd like to cluster points given to a custom distance and strangely, it seems that neither scipy nor sklearn clustering methods allow the specification of a distance function.
For instance, in sklearn.cluster.AgglomerativeClustering
, the only thing I may do is enter an affinity matrix (which will be very memory-heavy). In order to build this very matrix, it is recommended to use sklearn.neighbors.kneighbors_graph
, but I don't understand how I can specify a distance function either between two points. Could someone enlighten me?
All of the scipy hierarchical clustering routines will accept a custom distance function that accepts two 1D vectors specifying a pair of points and returns a scalar. For example, using fclusterdata
:
import numpy as np
from scipy.cluster.hierarchy import fclusterdata
# a custom function that just computes Euclidean distance
def mydist(p1, p2):
diff = p1 - p2
return np.vdot(diff, diff) ** 0.5
X = np.random.randn(100, 2)
fclust1 = fclusterdata(X, 1.0, metric=mydist)
fclust2 = fclusterdata(X, 1.0, metric='euclidean')
print(np.allclose(fclust1, fclust2))
# True
Valid inputs for the metric=
kwarg are the same as for scipy.spatial.distance.pdist
.