if you have this hierarchical clustering call in scipy in Python:
from scipy.cluster.hierarchy import linkage
# dist_matrix is long form distance matrix
linkage_matrix = linkage(squareform(dist_matrix), linkage_method)
then what's an efficient way to go from this to cluster assignments for individual points? i.e. a vector of length N
where N
is number of points, where each entry i
is the cluster number of point i
, given the number of clusters generated by a given threshold thresh
on the resulting clustering?
To clarify: The cluster number would be the cluster that it's in after applying a threshold to the tree. In which case you would get a unique cluster for each leaf node for the cluster that it is in. Unique in the sense that each point belongs to one "most specific cluster" which is defined by the threshold where you cut the dendrogram.
I know that scipy.cluster.hierarchy.fclusterdata
gives you this cluster assignment as its return value, but I am starting from a custom made distance matrix and distance metric, so I cannot use fclusterdata
. The question boils down to: how can I compute what fclusterdata
is computing -- the cluster assignments?
If I understand you right, that is what fcluster does:
scipy.cluster.hierarchy.fcluster(Z, t, criterion='inconsistent', depth=2, R=None, monocrit=None)
Forms flat clusters from the hierarchical clustering defined by the linkage matrix Z.
...
Returns: An array of length n. T[i] is the flat cluster number to which original observation i belongs.
So just call fcluster(linkage_matrix, t)
, where t
is your threshold.