views:

241

answers:

3

Wikipedia: Directed Acyclic Graph

Not sure if leaf node is still proper terminology since it's not really a tree (each node can have multiple children and also multiple parents) and also I'm actually trying to find all the root nodes (which is really just a matter of semantics, if you reverse the direction of all the edges it'd they'd be leaf nodes).

Right now we're just traversing the entire graph (that's reachable from the specified node), but that's turning out to be somewhat expensive, so I'm wondering if there's a better algorithm for doing this. One thing I'm thinking is that we keep track of nodes that have been visited already (while traversing a different path) and don't recheck those.

Are there any other algorithmic optimizations?

We also thought about keeping a list of root nodes that this node is a descendant of, but it seems like maintaining such a list would be fairly expensive as well if we need to check if it changes every time a node is added, moved, or removed.

Edit:

This is more than just finding a single node, but rather finding ALL nodes that are endpoints.

Also there is no master list of nodes. Each node has a list of it's children and it's parents. (Well, that's not completely true, but pulling millions of nodes from the DB ahead of time is prohibitively expensive and would likely cause an OutOfMemory exception)

Edit2:

May or may not change possible solutions, but the graph is bottom-heavy in that there's at most a few dozen root nodes (what I'm trying to find) and some millions (possibly tens or hundreds of millions) leaf nodes (where I'm starting from).

+2  A: 

Just color (keep track of) visited nodes.

Sample in Python:

def reachable(nodes, edges, start, end):
  color = {}
  for n in nodes:
    color[n] = False
  q = [start]
  while q:
    n = q.pop()
    if color[n]:
      continue
    color[n] = True
    for adj in edges[n]:
      q.append(adj)
  return color[end]
lispmachine
+1 because that'd work given the info given, but these nodes are our business objects and we'd need to uncolor all the nodes after we're done with them. I already mentioned keeping track of nodes already visited though
Davy8
The structure which contains color data does not need to be part of those business objects. It would be cumbersome (and not thread safe) to keep additional data for every algorithm. A simple mapping of those objects to boolean (like dict in example above) will suffice.
lispmachine
Good point. The other thing is that as I mentioned in my edit (not sure if you saw it) I don't have a list of all nodes and edges, I have the single node and each node has a list of it's parents that I need to traverse recursively. I'm guessing it's just a matter of passing around the "color" list.
Davy8
+2  A: 

There are a few methods that each may be faster depending on your structure, but in general what youre going to want is a traversal.

A depth first search, goes through each possible route, keeping track of nodes that have already been visited. It's a recursive function, because at each node you have to branch and try each child node of it. There's no faster method if you dont know which way to look for the object you just have to try each way! You definitely need to keep track of where you have already been because it would be wasteful otherwise. It should require on the order of the number of nodes to do a full traversal.

A breadth first search is similar but visits each child of the node before "moving on" and as such builds up layers of distance from the chosen root. This can be faster if the destination is expected to be close to the root node. It would be slower if it is expected to be all the way down a path, because it forces you to traverse every possible edge.

Youre right about maybe keeping a list of known root nodes, the tradeoff there is that you basically have to do the search whenever you alter the graph. If you are altering the graph rarely this is acceptable, but if you alter the graph more frequently than you need to generate this information, then of course it is too costly.

EDIT: Info Update. It sounds like we are actually looking for a path between two arbitrary nodes, the root/leaf semantic keeps getting switched. The DepthFirstSearch (DFS) starts at one node, and then for each unvisited child, recurse. Break if you find the target node. Due to the way recursion evaluates, this will traverse all the way down the 'left' path, then enumerate nodes at this distance before ever getting to the 'right' path. This is time costly and inefficient if the target node is potentially the first child on the right. BreadthFirst walks in steps, covering all children before moving forward. Because your graph is bottom heavy like a tree, both will be approximately the same execution time.

When the graph is bottom heavy you might be interested in a reverse traversal. Start at the target node and walk upwards, because there are relatively fewer nodes in this direction. So long as the nodes in general have more parents than children, this direction will be much faster. You can also combine the approaches, stepping one up and one down , then comparing lists of nodes, and meeting somewhere in the middle. (this combination might seem the fastest if you ignore that twice as much work is done at each step).

However, since you said that your graph is stored as a list of lists of children, you have no real way of traversing the graph backwards. A node does not know what its parents are. This is a problem. To fix it you have to get a node to know what its parents are by adding that data on graph update, or by creating a duplicate of the whole structure (which you have said is too large). It will need the whole structure to be rewritten, which sounds probably out of the question due to it being a large database at this point. There's a lot of work to do. http://en.wikipedia.org/wiki/Graph_(data_structure)

Karl
Well, we need to read this information about 3 times as often (very rough guesstimate) as the graph changes, but since as I just edited, the graph's more bottom heavy and I'm looking for nodes at the top.
Davy8
This is getting more and more complicated. Are you starting from a leaf and looking for a root, or are you starting from a root and looking for a leaf? The procedures will be different, and youve asked both... are we looking for a path between two arbitrary nodes? That might be a clearer solution.
Karl
Sorry for the confusion, I'm starting from an arbitrary node and looking for all roots (don't care about the path)
Davy8
They key point being I'm not looking for one specific node, I'm looking for ALL root nodes (i.e. nodes without parents, each node may have more than one parent) that are reachable from the specified node. I have no information on any nodes other than object references to a node's children and a node's parents.
Davy8
(No information on any of the rest of the nodes in the graph)
Davy8
Good that you have a list of a node's parents, so you can walk upwards. Since you need multiple roots you will have to do a lot of traversing, and you can always go in one direction, and if two paths merge (visited node is seen again) you can ignore it and let the other track take care of it. Just needs a couple global lists.
Karl
A: 

For a vertex x you want to compute a bit array f(x), each bit corresponds to a root vertex Ri, and 1 (resp 0) means "x can (resp can't) be reached from root vertex Ri.

You could partition the graph into one "upper" set U containing all your target roots R and such that if x in U then all parents of x are in U. For example the set of all vertices at distance <=D from the closest Ri.

Keep U not too big, and precompute f for each vertex x of U.

Then, for a query vertex y: if y is in U, you already have the result. Otherwise recursively perform the query for all parents of y, caching the value f(x) for each visited vertex x (in a map for example), so you won't compute a value twice. The value of f(y) is the bitwise OR of the value of its parents.

Eric Bainville