How to compute cluster assignments from linkage/distance matrices in scipy in Python?

user248237 picture user248237 · Apr 11, 2013 · Viewed 11.1k times · Source

if you have this hierarchical clustering call in scipy in Python:

from scipy.cluster.hierarchy import linkage
# dist_matrix is long form distance matrix
linkage_matrix = linkage(squareform(dist_matrix), linkage_method)

then what's an efficient way to go from this to cluster assignments for individual points? i.e. a vector of length N where N is number of points, where each entry i is the cluster number of point i, given the number of clusters generated by a given threshold thresh on the resulting clustering?

To clarify: The cluster number would be the cluster that it's in after applying a threshold to the tree. In which case you would get a unique cluster for each leaf node for the cluster that it is in. Unique in the sense that each point belongs to one "most specific cluster" which is defined by the threshold where you cut the dendrogram.

I know that scipy.cluster.hierarchy.fclusterdata gives you this cluster assignment as its return value, but I am starting from a custom made distance matrix and distance metric, so I cannot use fclusterdata. The question boils down to: how can I compute what fclusterdata is computing -- the cluster assignments?

Answer

BrenBarn picture BrenBarn · Apr 15, 2013

If I understand you right, that is what fcluster does:

scipy.cluster.hierarchy.fcluster(Z, t, criterion='inconsistent', depth=2, R=None, monocrit=None)

Forms flat clusters from the hierarchical clustering defined by the linkage matrix Z.

...

Returns: An array of length n. T[i] is the flat cluster number to which original observation i belongs.

So just call fcluster(linkage_matrix, t), where t is your threshold.