views:

361

answers:

4

I have a data set which is a large unweighted cyclic graph The cycles occur in loops of about 5-6 paths. It consists of about 8000 nodes and each node has from 1-6 (usually about 4-5) connections. I'm doing single pair shortest path calculations and have implemented the following code to do a breadth-first search.

from Queue import Queue

q = Queue()
parent = {}
fromNode = 'E1123'
toNode = 'A3455'

# path finding
q.put(fromNode)
parent[fromNode] = 'Root'

while not q.empty():
  # get the next node and add its neighbours to queue
  current = q.get()
  for i in getNeighbours(current):
    # note parent and only continue if not already visited
    if i[0] not in parent:
      parent[i[0]] = current
      q.put(i[0])

  # check if destination
  if current == toNode:
    print 'arrived at', toNode
    break

The above code uses the Python 2.6 Queue module and getNeighbours() is simply a subroutine that makes a single MySQL call and returns the neighbours as a list of tuples e.g. (('foo',),('bar',)). The SQL call is quick.

The code works ok however testing to down to depths of about 7 layers takes about 20 seconds to run (2.5GHz Intel 4GB RAM OS X 10.6)

I'd welcome any comments about how to improve the run time of this code.

A: 

I'll bet that machine has more than one core, doesn't it? Run it in parallel.

Python Threading

stimms
I think you mean Python Multiprocessing.
Jed Smith
+1  A: 

Hmm, doesn't BFS involve marking nodes you've already seen so you don't visit them again?

Carl Smotricz
It does. If the node has already been placed in parent[], then it has been seen before, and so it is not put in the queue to act on again.
timbo
Oh I see. I missed that, sorry. Still, would be cheaper to put that flag into each node than to manipulate an ever-growing set.
Carl Smotricz
+8  A: 

Well, given the upvotes on the comment, I'll make it an answer now.

The SQL in the tight loop is definitely slowing you down. I don't care how fast the call is. Think about it -- you're asking for a query to be parsed, a lookup to be run -- as fast as that is, it's still in a tight loop. What does your data set look like? Can you just SELECT the entire data set into memory, or at least work with it outside of MySQL?

If you work with that data in memory, you will see a significant performance gain.

Jed Smith
seconded, 8000 nodes will easily fit in memory.
Carl Smotricz
Good call! The table with the node information is simply rows of fromNode, toNode. I will investigate simply loading it into memory..maybe just a big Dictionary structure.
timbo
As a simple test, I have loaded the MySQL into memory by simply using ENGINE=MEMORY on the CREATE TABLE definition. Same code now completes in about 2.5 seconds!
timbo
@timbo: See there?
Jed Smith
A bit late to the party, but are you using prepared statements for your query? Should be able to shave a little SQL processing time with that, especially in the in-memory variation.
Carl Smotricz
+1  A: 

Something like this:

#!/usr/bin/env python

from Queue import Queue

def traverse_path(fromNode, toNode, nodes):
    def getNeighbours(current, nodes):
        return nodes[current] if current in nodes else []

    def make_path(toNode, graph):
     result = []
     while 'Root' != toNode:
      result.append(toNode)
      toNode = graph[toNode]
     result.reverse()
     return result

    q = Queue()
    q.put(fromNode)
    graph = {fromNode: 'Root'}

    while not q.empty():
     # get the next node and add its neighbours to queue
     current = q.get()
     for neighbor in getNeighbours(current, nodes):
      # use neighbor only continue if not already visited
      if neighbor not in graph:
       graph[neighbor] = current
       q.put(neighbor)

     # check if destination
     if current == toNode:
      return make_path(toNode, graph)
    return []

if __name__ == '__main__':
    nodes = {
     'E1123': ['D111', 'D222', 'D333', 'D444'],
     'D111': ['C01', 'C02', 'C04'],
     'D222': ['C11', 'C03', 'C05'],
     'D333': ['C01'],
     'C02': ['B1'],
     'B1': ['A3455']
    }
    result = traverse_path('E1123', 'A3455', nodes)
    print result

['E1123', 'D111', 'C02', 'B1', 'A3455']

If you replace your SQL queries with a dictionary of lists (and that would be the tricky part), you will get this performance.

hughdbrown
Awesome Hugh. I've been getting good performance with an in memory table. I will try the next step. There are about 14K connections and this will eventually be a web app so I need to think carefully about how the load into memory works..
timbo
I've found that my code was similar to yours Hugh however yours is certainly more elegant. Apart from a locale-dependent spelling of 'neighbour/neighbor', the only extra suggestion I would have for the above is to call result.reverse() in make_path as this returns a list in the order fromNode -> toNode
timbo
Hmmmm. That's a good idea. I'll add that.
hughdbrown