How to visualize (dendrogram) a dictionary of hierarchical items?

MarkokraM picture MarkokraM · Feb 18, 2016 · Viewed 7.3k times · Source

This is my first time of doing visualization from hierarchical data in dictionary format with Python. Last part of the data looks like this:

d = {^2820: [^391, ^1024], ^2821: [^759, 'w', ^118, ^51], ^2822: [^291, 'o'], ^2823: [^25, ^64], ^2824: [^177, ^2459], ^2825: [^338, ^1946], ^2826: [^186, ^1511], ^2827: [^162, 'i']}

So I have indices on lists referring back to the keys (index) of the dictionary. I suppose this could be used as a base structure for the visualization, please correct me if I'm wrong. Characters on the data are "end nodes/leaves" which doesn't refer back to any index.

I have found NetworkX which possibly could be used for visualization, but I have no idea where to start with it and my data. I was hoping it would be something as simple as:

import networkx as nx
import matplotlib.pyplot as plt

d = {^2820: [^391, ^1024], ^2821: [^759, 'w', ^118, ^51], ^2822: [^291, 'o'], ^2823: [^25, ^64], ^2824: [^177, ^2459], ^2825: [^338, ^1946], ^2826: [^186, ^1511], ^2827: [^162, 'i']}

G = nx.Graph(d)
nx.draw(G)
plt.show()

I'm looking for hierarchical dendrogram or some sort of clustering as an output. Sorry at this point I'm not totally sure what would be the best visualization, maybe similar to this:

Dendrogram

UPDATE

Using NetworkX actually was very simple. I'm providing other simple sample data and looking for an answer if it can be visualized by dendrogram also instead of wired network graph?

# original sequence: a,b,c,d,b,c,a,b,c,d,b,c
d = {^1: ['b', 'c'], ^2: ['a', ^1, 'd', ^1], 'S': [^2, ^2]}
G = nx.Graph(d)
nx.draw_spring(G, node_size=300, with_labels=True)

NetworkX draw_spring nx.Graph

As we can see, graph show plain relations, but not hierarchy and order of the data what I'm willing to do. DiGraph gives more details, but it is still not possible to construct original sequence from it:

NetworkX draw_spring nx.DiGraph

For dendrogram apparently weight and end nodes needs to be calculated as pointed out on the first answer. For that approach data structure could be something like this:

d = {'a': [], 'b': [], 'c': [], 'd': [], ^1: ['b', 'c'], ^2: ['a', ^1, 'd', ^1], 'S': [^2, ^2]}

Answer

mdml picture mdml · Feb 19, 2016

One idea is to use SciPy's dendrogram function to draw your dendrogram. To do so, you just need to create the linkage matrix Z, which is described in the documentation of the SciPy linkage function. Each row [x, y, w, z] of the linkage matrix Z describes the weight w at which x and y merge to form a rooted subtree with z leaves.

To demonstrate, I've created a simple example using a small dendrogram with three leaves, shown below: example dendrogram

I created this visualization with the following code:

# Load required modules
import networkx as nx
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import dendrogram

# Construct the graph/hierarchy
d           = { 0: [1, 'd'], 1: ['a', 'b', 'c'], 'a': [], 'b': [], 'c': [], 'd': []}
G           = nx.DiGraph(d)
nodes       = G.nodes()
leaves      = set( n for n in nodes if G.out_degree(n) == 0 )
inner_nodes = [ n for n in nodes if G.out_degree(n) > 0 ]

# Compute the size of each subtree
subtree = dict( (n, [n]) for n in leaves )
for u in inner_nodes:
    children = set()
    node_list = list(d[u])
    while len(node_list) > 0:
        v = node_list.pop(0)
        children.add( v )
        node_list += d[v]

    subtree[u] = sorted(children & leaves)

inner_nodes.sort(key=lambda n: len(subtree[n])) # <-- order inner nodes ascending by subtree size, root is last

# Construct the linkage matrix
leaves = sorted(leaves)
index  = dict( (tuple([n]), i) for i, n in enumerate(leaves) )
Z = []
k = len(leaves)
for i, n in enumerate(inner_nodes):
    children = d[n]
    x = children[0]
    for y in children[1:]:
        z = tuple(subtree[x] + subtree[y])
        i, j = index[tuple(subtree[x])], index[tuple(subtree[y])]
        Z.append([i, j, float(len(subtree[n])), len(z)]) # <-- float is required by the dendrogram function
        index[z] = k
        subtree[z] = list(z)
        x = z
        k += 1

# Visualize
dendrogram(Z, labels=leaves)
plt.show()

There are a few key items to note:

  1. Give the d data structure, I use a NetworkX directed graph (DiGraph) . The directionality is important so we can determine which nodes are leaves (no children -> out degree of zero) and inner_nodes (two or more children -> non-zero out degree).
  2. Usually there is some weight associated with each edge in your dendrogram, but there weren't any weights in your example. Instead, I used the number of leaves in the subtree rooted at each internal node n as the weight for n.
  3. If an inner node has more than two children, you have to add "dummy" internal nodes, since each row of the linkage matrix merges two nodes together. This is why I write for y in children[1:]:, etc.

I'm guessing you may be able to simplify this code based on what your data looks like before creating the d in your example, so this may be more of a proof of concept.