How can I cluster a graph in Python?

Pietro Speroni picture Pietro Speroni · Mar 17, 2009 · Viewed 18.5k times · Source

Let G be a graph. So G is a set of nodes and set of links. I need to find a fast way to partition the graph. The graph I am now working has only 120*160 nodes, but I might soon be working on an equivalent problem, in another context (not medicine, but website development), with millions of nodes.

So, what I did was to store all the links into a graph matrix:

M=numpy.mat(numpy.zeros((len(data.keys()),len(data.keys()))))

Now M holds a 1 in position s,t, if node s is connected to node t. I make sure M is symmetrical M[s,t]=M[t,s] and each node links to itself M[s,s]=1.

If I remember well if I multiply M with M, the results is a matrix that represents the graph that connects vertexes that are reached on through two steps.

So I keep on multplying M with itself, until the number of zeros in the matrix do not decrease any longer. Now I have the list of the connected components. And now I need to cluster this matrix.

Up to now I am pretty satisfied with the algorithm. I think it is easy, elegant, and reasonably fast. I am having trouble with this part.

Essentially I need to split this graph into its connected components.

I can go through all the nodes, and see what are they connected to.

But what about sorting the matrix reordering the lines. But I don't know if it is possible to do it.

What follows is the code so far:

def findzeros(M):
    nZeros=0
    for t in M.flat:
        if not t:
            nZeros+=1
    return nZeros

M=numpy.mat(numpy.zeros((len(data.keys()),len(data.keys()))))    
for s in data.keys():
    MatrixCells[s,s]=1
    for t in data.keys():
        if t<s:
            if (scipy.corrcoef(data[t],data[s])[0,1])>threashold:
                M[s,t]=1
                M[t,s]=1

nZeros=findzeros(M)
M2=M*M
nZeros2=findzeros(M2)

while (nZeros-nZeros2):
    nZeros=nZeros2
    M=M2
    M2=M*M
    nZeros2=findzeros(M2)

Edit:

It has been suggested that I use SVD decomposition. Here is a simple example of the problem on a 5x5 graph. We shall use this since with the 19200x19200 square matrix is not that easy to see the clusters.

import numpy
import scipy

M=numpy.mat(numpy.zeros((5,5)))

M[1,3]=1
M[3,1]=1
M[1,1]=1
M[2,2]=1
M[3,3]=1
M[4,4]=1
M[0,0]=1

print M

u,s,vh = numpy.linalg.linalg.svd(M)
print u
print s
print vh

Essentially there are 4 clusters here: (0),(1,3),(2),(4) But I still don't see how the svn can help in this context.

Answer

kquinn picture kquinn · Mar 17, 2009

Why not use a real graph library, like Python-Graph? It has a function to determine connected components (though no example is provided). I'd imagine a dedicated library is going to be faster than whatever ad-hoc graph code you've cooked up.

EDIT: NetworkX seems like it might be a better choice than python-graph; its documentation (here for the connected components function) certainly is.