views:

72

answers:

2

I have a set of data on which I need to perform a topological sort, with a few assumptions and constraints, and I was wondering if anyone knew an existing, efficient algorithm that would be appropriate for this.

  • The data relationships are known to form a DAG (so no cycles to worry about).
  • An edge from A to B indicates that A depends on B, so B must appear before A in the topological ordering.
  • The graph is not necessarily connected; that is, for any two nodes N and M there may be no way to get from N to M by following edges (even if you ignore edge direction).
  • The data relationships are singly linked. This means that when there is an edge directed from A to B, only the A node contains information about the existence of the edge.

The problem can be formulated as follows:

Given a set of nodes S in graph G which may or may not have incoming edges, find a topological ordering of the subgraph G' consisting of all of the nodes in G that are reachable from any node in set S (obeying edge direction).

This confounds the usual approaches to topological sorting because they require that the nodes in set S do not have any incoming edges, which is something that is not true in my case. The pathological case is:

A --> B --> D
|     ^     ^
|     |     |
\---> C ----/

Where S = {B, C}. An appropriate ordering would be D, B, C, but if a normal topological sort algorithm happened to consider B before C, it would end up with C, D, B, which is completely wrong. (Note that A does not appear in the resulting ordering since it is not reachable from S; it's there to give an example where all of the nodes in S might have incoming edges)

Now, I have to imagine that this is a long-solved problem, since this is essentially what programs like apt and yum have to do when you specify multiple packages in one install command. However, when I search for keyphrases like "dependency resolution algorithm", I get results describing normal topological sorting, which does not handle this particular case.

I can think of a couple of ways to do this, but none of them seem particularly elegant. I was wondering if anyone had some pointers to an appropriate algorithm, preferably one that can operate in a single pass over the data.

+1  A: 

I think you can do a topological sorting of the entire graph and then select only the nodes which are reachable from the set of nodes (you can do some depth first searches from the nodes in the set, in the order resulted after the sorting, and you'll get in the subtree of a node if it wasn't visited before).

Teodor Pripoae
It's not really practical for me to sort the entire graph, since the graph is very large, the portion I want to sort is going to be quite small, and the node information comes out of a database, which would make sorting the whole graph very, very slow.
Tyler McHenry
Ok, so you can make the depth first searches entering in a node if the node wasn't visited before, so you'll get the subgraph and then sort the subgraph. The time complexity is o(k + m), where k is the size of the subgraph and m is the number of links in that subgraph.
Teodor Pripoae
+2  A: 

I don't think you'll find an algorithm that can do this with a single pass over the data. I would perform a breadth-first search, starting with the nodes in S, and then do a topological sort on the resulting subgraph.

Peter Ruderman
Eventually went with this, though I'd still be interested to know if there's anything better and less brute-force out there.
Tyler McHenry