I am using the seaborn clustermap
to create clusters and visually it works great (this example produces very similar results).
However I am having trouble figuring out how to programmatically extract the clusters. For instance, in the example link, how could I find out that 1-1 rh, 1-1 lh, 5-1 rh, 5-1 lh make a good cluster? Visually it's easy. I am trying to use methods of looking through the data, and dendrograms but I'm having little success
EDIT Code from example:
import pandas as pd
import seaborn as sns
sns.set(font="monospace")
df = sns.load_dataset("brain_networks", header=[0, 1, 2], index_col=0)
used_networks = [1, 5, 6, 7, 8, 11, 12, 13, 16, 17]
used_columns = (df.columns.get_level_values("network")
.astype(int)
.isin(used_networks))
df = df.loc[:, used_columns]
network_pal = sns.cubehelix_palette(len(used_networks),
light=.9, dark=.1, reverse=True,
start=1, rot=-2)
network_lut = dict(zip(map(str, used_networks), network_pal))
networks = df.columns.get_level_values("network")
network_colors = pd.Series(networks).map(network_lut)
cmap = sns.diverging_palette(h_neg=210, h_pos=350, s=90, l=30, as_cmap=True)
result = sns.clustermap(df.corr(), row_colors=network_colors, method="average",
col_colors=network_colors, figsize=(13, 13), cmap=cmap)
How can I pull what models are in which clusters out of result
?
EDIT2 The result
does carry with it a linkage
in with the dendrogram_col
which I THINK would work with fcluster. But the threshold value to select that is confusing me. I would assume that values in the heatmap that are higher than the threshold would get clustered together?
While using result.linkage.dendrogram_col
or result.linkage.dendrogram_row
will currently work, it seems to be an implementation detail. The safest route is to first compute the linkages explicitly and pass them to the clustermap
function, which has row_linkage
and col_linkage
parameters just for that.
Replacing the last line in your example (result =
...) with the following code gives the same result as before, but you will also have row_linkage
and col_linkage
variables that you can use with fcluster
etc.
from scipy.spatial import distance
from scipy.cluster import hierarchy
correlations = df.corr()
correlations_array = np.asarray(df.corr())
row_linkage = hierarchy.linkage(
distance.pdist(correlations_array), method='average')
col_linkage = hierarchy.linkage(
distance.pdist(correlations_array.T), method='average')
sns.clustermap(correlations, row_linkage=row_linkage, col_linkage=col_linkage, row_colors=network_colors, method="average",
col_colors=network_colors, figsize=(13, 13), cmap=cmap)
In this particular example, the code could be simplified more since the correlations array is symmetric and therefore row_linkage
and col_linkage
will be identical.
Note: A previous answer included a call to distance.squareshape
according to what the code in seaborn does, but that is a bug.