



Let G be a graph. So G is a set of nodes and set of links. I need to find a fast way to partition the graph. The graph I am now working has only 120*160 nodes, but I might soon be working on an equivalent problem, in another context (not medicine, but website development), with millions of nodes.

So, what I did was to store all the links into a graph matrix:


Now M holds a 1 in position s,t, if node s is connected to node t. I make sure M is symmetrical M[s,t]=M[t,s] and each node links to itself M[s,s]=1.

If I remember well if I multiply M with M, the results is a matrix that represents the graph that connects vertexes that are reached on through two steps.

So I keep on multplying M with itself, until the number of zeros in the matrix do not decrease any longer. Now I have the list of the connected components. And now I need to cluster this matrix.

Up to now I am pretty satisfied with the algorithm. I think it is easy, elegant, and reasonably fast. I am having trouble with this part.

Essentially I need to split this graph into its connected components.

I can go through all the nodes, and see what are they connected to.

But what about sorting the matrix reordering the lines. But I don't know if it is possible to do it.

What follows is the code so far:

def findzeros(M):
    for t in M.flat:
        if not t:
    return nZeros

for s in data.keys():
    for t in data.keys():
        if t<s:
            if (scipy.corrcoef(data[t],data[s])[0,1])>threashold:


while (nZeros-nZeros2):


It has been suggested that I use SVD decomposition. Here is a simple example of the problem on a 5x5 graph. We shall use this since with the 19200x19200 square matrix is not that easy to see the clusters.

import numpy
import scipy



print M

u,s,vh = numpy.linalg.linalg.svd(M)
print u
print s
print vh

Essentially there are 4 clusters here: (0),(1,3),(2),(4) But I still don't see how the svn can help in this context.

+5  A: 

In SciPy you can use sparse matrices. Also note, that there are more efficient ways of multiplying matrix by itself. Anyway, what you're trying to do can by done by SVD decomposition.

Introduction with useful links.

Thank you. I looked up the resource, but I honestly do not see how it can help. I updated the question with a simple example, and how SVN des not seem to solve it. Or then maybe I am using it wrongly? But how then? Thanks in any case :)
Pietro Speroni
That's SVD (Singlular Value Decomposition). Basically for something as large as millions of nodes, you'll need approximation algorithm, rather than exact one (graph clustering is NP-complete). Article got links to papers explaining such algorithms.
BTW. are you trying to reinvent PageRank or HITS?
Not really. Right now just sorting which data belong to which biological cell. In fuure I have an equivalent problem that will eventually generate a search engine. But not on pages. And not using links. (Can't say more at this stage :) ). In any case, congratulations! Well spotted, LOL.
Pietro Speroni
Latent Semantic Analysis then? ;-) Ok, I'm not going to pull your tongue. Just keep in mind, that what is possible in small scale, gets really complicated when it's big. Most graph algorithms have hight polynomial complexity, so it's to fissile to use then on 1mln nodes.
+4  A: 

Why not use a real graph library, like Python-Graph? It has a function to determine connected components (though no example is provided). I'd imagine a dedicated library is going to be faster than whatever ad-hoc graph code you've cooked up.

EDIT: NetworkX seems like it might be a better choice than python-graph; its documentation (here for the connected components function) certainly is.

Thank you! Looks like a great resource. I shall investigate it thoroughly.
Pietro Speroni
+1  A: 

Looks like there is a library PyMetis, which will partition your graph for you, given a list of links. It should be fairly easy to extract the list of links from your graph by passing it your original list of linked nodes (not the matrix-multiply-derived one).

Repeatedly performing M' = MM will not be efficient for large orders of M. A full matrix-multiply for matrices of order N will cost N multiplications and N-1 additions per element, of which there are N2, that is O(N3) operations. If you are scaling that to "millions of nodes", that would be O(1018) operations per matrix-matrix multiplication, of which you want to do several.

In short, you don't want to do it this way. The SVD suggestion from Vartec would be the only appropriate choice there. Your best option is just to use PyMetis, and not try to reinvent graph-partitioning.

Phil H
Thanks. I admit the SVD suggestion totally went over my head. I am aware that graph partitioning is a well studied problem, so I was hoping to get some good insights when I posted here. But I also wanted to write what I knew, to show my good will :-)
Pietro Speroni
I think the key is to decide whether you want to learn about partitioning enough to rewrite the software on it (probably not), or whether you just want to partition a graph. If you decide just to use existing solutions, pick a library and use it. Seek to solve it at the highest level.
Phil H
I tried to install PyMetis, but it seem to have a hard time in installing. There seem to be no configuration file. Looking for the easiest way out I shall instead install networkx. Thanks, Pietro
Pietro Speroni
+1  A: 

Here's some naive implementation, which finds the connected components using depth first search, i wrote some time ago. Although it's very simple, it scales well to ten thousands of vertices and edges...

import sys
from operator import gt, lt

class Graph(object):
    def __init__(self):
        self.nodes = set()
        self.edges = {}
        self.cluster_lookup = {}
        self.no_link = {}

    def add_edge(self, n1, n2, w):
        self.edges.setdefault(n1, {}).update({n2: w})
        self.edges.setdefault(n2, {}).update({n1: w})

    def connected_components(self, threshold=0.9, op=lt):
        nodes = set(self.nodes)
        components, visited = [], set()
        while len(nodes) > 0:
            connected, visited = self.dfs(nodes.pop(), visited, threshold, op)
            connected = set(connected)
            for node in connected:
                if node in nodes:

            subgraph = Graph()
            subgraph.nodes = connected
            subgraph.no_link = self.no_link
            for s in subgraph.nodes:
                for k, v in self.edges.get(s, {}).iteritems():
                    if k in subgraph.nodes:
                        subgraph.edges.setdefault(s, {}).update({k: v})
                if s in self.cluster_lookup:
                    subgraph.cluster_lookup[s] = self.cluster_lookup[s]

        return components

    def dfs(self, v, visited, threshold, op=lt, first=None):
        aux = [v]
        if first is None:
            first = v
        for i in (n for n, w in self.edges.get(v, {}).iteritems()
                  if op(w, threshold) and n not in visited):
            x, y = self.dfs(i, visited, threshold, op, first)
            visited = visited.union(y)
        return aux, visited

def main(args):
    graph = Graph()
    # first component
    graph.add_edge(0, 1, 1.0)
    graph.add_edge(1, 2, 1.0)
    graph.add_edge(2, 0, 1.0)

    # second component
    graph.add_edge(3, 4, 1.0)
    graph.add_edge(4, 5, 1.0)
    graph.add_edge(5, 3, 1.0)

    first, second = graph.connected_components(op=gt)
    print first.nodes
    print second.nodes

if __name__ == '__main__':

The SVD algorithm is not applicable here, but otherwise Phil H is correct.


As others have pointed out, no need to reinvent the wheel. A lot of thought has been put into optimal clustering techniques. Here is one well-known clustering program.
