Let G be a graph. So G is a set of nodes and set of links. I need to find a fast way to partition the graph. The graph I am now working has only 120*160 nodes, but I might soon be working on an equivalent problem, in another context (not medicine, but website development), with millions of nodes.
So, what I did was to store all the links into a graph matrix:
M=numpy.mat(numpy.zeros((len(data.keys()),len(data.keys()))))
Now M holds a 1 in position s,t, if node s is connected to node t. I make sure M is symmetrical M[s,t]=M[t,s] and each node links to itself M[s,s]=1.
If I remember well if I multiply M with M, the results is a matrix that represents the graph that connects vertexes that are reached on through two steps.
So I keep on multplying M with itself, until the number of zeros in the matrix do not decrease any longer. Now I have the list of the connected components. And now I need to cluster this matrix.
Up to now I am pretty satisfied with the algorithm. I think it is easy, elegant, and reasonably fast. I am having trouble with this part.
Essentially I need to split this graph into its connected components.
I can go through all the nodes, and see what are they connected to.
But what about sorting the matrix reordering the lines. But I don't know if it is possible to do it.
What follows is the code so far:
def findzeros(M):
nZeros=0
for t in M.flat:
if not t:
nZeros+=1
return nZeros
M=numpy.mat(numpy.zeros((len(data.keys()),len(data.keys()))))
for s in data.keys():
MatrixCells[s,s]=1
for t in data.keys():
if t<s:
if (scipy.corrcoef(data[t],data[s])[0,1])>threashold:
M[s,t]=1
M[t,s]=1
nZeros=findzeros(M)
M2=M*M
nZeros2=findzeros(M2)
while (nZeros-nZeros2):
nZeros=nZeros2
M=M2
M2=M*M
nZeros2=findzeros(M2)
It has been suggested that I use SVD decomposition. Here is a simple example of the problem on a 5x5 graph. We shall use this since with the 19200x19200 square matrix is not that easy to see the clusters.
import numpy
import scipy
M=numpy.mat(numpy.zeros((5,5)))
M[1,3]=1
M[3,1]=1
M[1,1]=1
M[2,2]=1
M[3,3]=1
M[4,4]=1
M[0,0]=1
print M
u,s,vh = numpy.linalg.linalg.svd(M)
print u
print s
print vh
Essentially there are 4 clusters here: (0),(1,3),(2),(4) But I still don't see how the svn can help in this context.
Why not use a real graph library, like Python-Graph? It has a function to determine connected components (though no example is provided). I'd imagine a dedicated library is going to be faster than whatever ad-hoc graph code you've cooked up.
EDIT: NetworkX seems like it might be a better choice than python-graph; its documentation (here for the connected components function) certainly is.