This is my first time of doing visualization from hierarchical data in dictionary format with Python. Last part of the data looks like this:
d = {^2820: [^391, ^1024], ^2821: [^759, 'w', ^118, ^51], ^2822: [^291, 'o'], ^2823: [^25, ^64], ^2824: [^177, ^2459], ^2825: [^338, ^1946], ^2826: [^186, ^1511], ^2827: [^162, 'i']}
So I have indices on lists referring back to the keys (index) of the dictionary. I suppose this could be used as a base structure for the visualization, please correct me if I'm wrong. Characters on the data are "end nodes/leaves" which doesn't refer back to any index.
I have found NetworkX which possibly could be used for visualization, but I have no idea where to start with it and my data. I was hoping it would be something as simple as:
import networkx as nx
import matplotlib.pyplot as plt
d = {^2820: [^391, ^1024], ^2821: [^759, 'w', ^118, ^51], ^2822: [^291, 'o'], ^2823: [^25, ^64], ^2824: [^177, ^2459], ^2825: [^338, ^1946], ^2826: [^186, ^1511], ^2827: [^162, 'i']}
G = nx.Graph(d)
nx.draw(G)
plt.show()
I'm looking for hierarchical dendrogram or some sort of clustering as an output. Sorry at this point I'm not totally sure what would be the best visualization, maybe similar to this:
Using NetworkX actually was very simple. I'm providing other simple sample data and looking for an answer if it can be visualized by dendrogram also instead of wired network graph?
# original sequence: a,b,c,d,b,c,a,b,c,d,b,c
d = {^1: ['b', 'c'], ^2: ['a', ^1, 'd', ^1], 'S': [^2, ^2]}
G = nx.Graph(d)
nx.draw_spring(G, node_size=300, with_labels=True)
As we can see, graph show plain relations, but not hierarchy and order of the data what I'm willing to do. DiGraph gives more details, but it is still not possible to construct original sequence from it:
For dendrogram apparently weight and end nodes needs to be calculated as pointed out on the first answer. For that approach data structure could be something like this:
d = {'a': [], 'b': [], 'c': [], 'd': [], ^1: ['b', 'c'], ^2: ['a', ^1, 'd', ^1], 'S': [^2, ^2]}
One idea is to use SciPy's dendrogram
function to draw your dendrogram. To do so, you just need to create the linkage matrix Z
, which is described in the documentation of the SciPy linkage
function. Each row [x, y, w, z]
of the linkage matrix Z
describes the weight w
at which x
and y
merge to form a rooted subtree with z
leaves.
To demonstrate, I've created a simple example using a small dendrogram with three leaves, shown below:
I created this visualization with the following code:
# Load required modules
import networkx as nx
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import dendrogram
# Construct the graph/hierarchy
d = { 0: [1, 'd'], 1: ['a', 'b', 'c'], 'a': [], 'b': [], 'c': [], 'd': []}
G = nx.DiGraph(d)
nodes = G.nodes()
leaves = set( n for n in nodes if G.out_degree(n) == 0 )
inner_nodes = [ n for n in nodes if G.out_degree(n) > 0 ]
# Compute the size of each subtree
subtree = dict( (n, [n]) for n in leaves )
for u in inner_nodes:
children = set()
node_list = list(d[u])
while len(node_list) > 0:
v = node_list.pop(0)
children.add( v )
node_list += d[v]
subtree[u] = sorted(children & leaves)
inner_nodes.sort(key=lambda n: len(subtree[n])) # <-- order inner nodes ascending by subtree size, root is last
# Construct the linkage matrix
leaves = sorted(leaves)
index = dict( (tuple([n]), i) for i, n in enumerate(leaves) )
Z = []
k = len(leaves)
for i, n in enumerate(inner_nodes):
children = d[n]
x = children[0]
for y in children[1:]:
z = tuple(subtree[x] + subtree[y])
i, j = index[tuple(subtree[x])], index[tuple(subtree[y])]
Z.append([i, j, float(len(subtree[n])), len(z)]) # <-- float is required by the dendrogram function
index[z] = k
subtree[z] = list(z)
x = z
k += 1
# Visualize
dendrogram(Z, labels=leaves)
plt.show()
There are a few key items to note:
d
data structure, I use a NetworkX directed graph (DiGraph
) . The directionality is important so we can determine which nodes are leaves
(no children -> out degree of zero) and inner_nodes
(two or more children -> non-zero out degree).n
as the weight for n
.for y in children[1:]:
, etc.I'm guessing you may be able to simplify this code based on what your data looks like before creating the d
in your example, so this may be more of a proof of concept.