tags:

views:

330

answers:

2

I'm rewriting a data-driven legacy application in Python. One of the primary tables is referred to as a "graph table", and does appear to be a directed graph, so I was exploring the NetworkX package to see whether it would make sense to use it for the graph table manipulations, and really implement it as a graph rather than a complicated set of arrays.

However I'm starting to wonder whether the way we use this table is poorly suited for an actual graph manipulation library. Most of the NetworkX functionality seems to be oriented towards characterizing the graph itself in some way, determining shortest distance between two nodes, and things like that. None of that is relevant to my application.

I'm hoping if I can describe the actual usage here, someone can advise me whether I'm just missing something -- I've never really worked with graphs before so this is quite possible -- or if I should be exploring some other data structure. (And if so, what would you suggest?)

We use the table primarily to transform a user-supplied string of keywords into an ordered list of components. This constitutes 95% of the use cases; the other 5% are "given a partial keyword string, supply all possible completions" and "generate all possible legal keyword strings". Oh, and validate the graph against malformation.

Here's an edited excerpt of the table. Columns are:

keyword innode outnode component

acs 1 20 clear
default 1 100 clear
noota 20 30 clear
default 20 30 hst_ota
ota 20 30 hst_ota
acs 30 10000 clear
cos 30 11000 clear
sbc 10000 10199 clear
hrc 10000 10150 clear
wfc1 10000 10100 clear
default 10100 10101 clear
default 10101 10130 acs_wfc_im123
f606w 10130 10140 acs_f606w
f550m 10130 10140 acs_f550m
f555w 10130 10140 acs_f555w
default 10140 10300 clear
wfc1 10300 10310 acs_wfc_ebe_win12f
default 10310 10320 acs_wfc_ccd1

Given the keyword string "acs,wfc1,f555w" and this table, the traversal logic is:

  • Start at node 1; "acs" is in the string, so go to node 20.

  • None of the presented keywords for node 20 are in the string, so choose the default, pick up hst_ota, and go to node 30.

  • "acs" is in the string, so go to node 10000.

  • "wfc1" is in the string, so go to node 10100.

  • Only one choice; go to node 10101.

  • Only one choice, so pick up acs_wfc_im123 and go to node 10130.

  • "f555w" is in the string, so pick up acs_f555w and go to node 10140.

  • Only one choice, so go to node 10300.

  • "wfc1" is in the string, so pick up acs_wfc_ebe_win12f and go to node 10310.

  • Only one choice, so pick up acs_wfc_ccd1 and go to node 10320 -- which doesn't exist, so we're done.

Thus the final list of components is

hst_ota
acs_wfc_im123
acs_f555w
acs_wfc_ebe_win12f
acs_wfc_ccd1

I can make a graph from just the innodes and outnodes of this table, but I couldn't for the life of me figure out how to build in the keyword information that determines which choice to make when faced with multiple possibilities.

Updated to add examples of the other use cases:

  • Given a string "acs", return ("hrc","wfc1") as possible legal next choices

  • Given a string "acs, wfc1, foo", raise an exception due to an unused keyword

  • Return all possible legal strings:

    • cos
    • acs, hrc
    • acs, wfc1, f606w
    • acs, wfc1, f550m
    • acs, wfc1, f555w
  • Validate that all nodes can be reached and that there are no loops.

I can tweak Alex's solution for the first two of these, but I don't see how to do it for the last two.

+2  A: 

Definitely not suitable for general purpose graph libraries (whatever you're supposed to do if more than one of the words meaningful in a node is in the input string -- is that an error? -- or if none does and there is no default for the node, as for node 30 in the example you supply). Just write the table as a dict from node to tuple (default stuff, dict from word to specific stuff) where each stuff is a tuple (destination, word-to-add) (and use None for the special "word-to-add" clear). So e.g.:

tab = {1: (100, None), {'acs': (20, None)}),
       20: ((30, 'hst_ota'), {'ota': (30, 'hst_ota'), 'noota': (30, None)}),
       30: ((None, None), {'acs': (10000,None), 'cos':(11000,None)}),
       etc etc

Now handling this table and an input comma-separated string is easy, thanks to set operations -- e.g.:

def f(icss):
  kws = set(icss.split(','))
  N = 1
  while N in tab:
    stuff, others = tab[N]
    found = kws & set(others)
    if found:
      # maybe error if len(found) > 1 ?
      stuff = others[found.pop()]
    N, word_to_add = stuff
    if word_to_add is not None:
      print word_to_add
Alex Martelli
Thanks Alex! This is very helpful for the traversal use case. I'm not sure about the other cases though - see updates above. In particular I wonder if a graph library would be applicable in the validation case?
Vicki Laidler
Vicki Laidler
A: 

Adding an answer to respond to the further requirements newly edited in...: I still wouldn't go for a general-purpose library. For "all nodes can be reached and there are no loops", simply reasoning in terms of sets (ignoring the triggering keywords) should do: (again untested code, but the general outline should help even if there's some typo &c):

def add_descendants(someset, node):
  "auxiliary function: add all descendants of node to someset"
  stuff, others = tab[node]
  othernode, _ = stuff
  if othernode is not None:
    someset.add(othernode)
  for othernode, _ in others.values():
    if othernode is not None:
      someset.add(othernode)

def islegal():
  "Return bool, message (bool is True for OK tab, False if not OK)"
  # make set of all nodes ever mentioned in the table
  all_nodes = set()
  for node in tab:
    all_nodes.add(node)
    add_desendants(all_nodes, node)

  # check for loops and connectivity
  previously_seen = set()
  currently_seen = set([1])
  while currently_seen:
    node = currently_seen.pop()
    if node in previously_seen:
      return False, "loop involving node %s" % node
    previously_seen.add(node)
    add_descendants(currently_seen, node)

  unreachable = all_nodes - currently_seen
  if unreachable:
    return False, "%d unreachable nodes: %s" % (len(unreachable), unreachable)
  else:
    terminal = currently_seen - set(tab)
    if terminal:
      return True, "%d terminal nodes: %s" % (len(terminal), terminal)
    return True, "Everything hunky-dory"

For the "legal strings" you'll need some other code, but I can't write it for you because I have not yet understood what makes a string legal or otherwise...!

Alex Martelli