I have a set of data on which I need to perform a topological sort, with a few assumptions and constraints, and I was wondering if anyone knew an existing, efficient algorithm that would be appropriate for this.
- The data relationships are known to form a DAG (so no cycles to worry about).
- An edge from A to B indicates that A depends on B, so B must appear before A in the topological ordering.
- The graph is not necessarily connected; that is, for any two nodes N and M there may be no way to get from N to M by following edges (even if you ignore edge direction).
- The data relationships are singly linked. This means that when there is an edge directed from A to B, only the A node contains information about the existence of the edge.
The problem can be formulated as follows:
Given a set of nodes
S
in graphG
which may or may not have incoming edges, find a topological ordering of the subgraphG'
consisting of all of the nodes inG
that are reachable from any node in setS
(obeying edge direction).
This confounds the usual approaches to topological sorting because they require that the nodes in set S
do not have any incoming edges, which is something that is not true in my case. The pathological case is:
A --> B --> D
| ^ ^
| | |
\---> C ----/
Where S = {B, C}
. An appropriate ordering would be D, B, C
, but if a normal topological sort algorithm happened to consider B
before C
, it would end up with C, D, B
, which is completely wrong. (Note that A
does not appear in the resulting ordering since it is not reachable from S
; it's there to give an example where all of the nodes in S
might have incoming edges)
Now, I have to imagine that this is a long-solved problem, since this is essentially what programs like apt
and yum
have to do when you specify multiple packages in one install command. However, when I search for keyphrases like "dependency resolution algorithm", I get results describing normal topological sorting, which does not handle this particular case.
I can think of a couple of ways to do this, but none of them seem particularly elegant. I was wondering if anyone had some pointers to an appropriate algorithm, preferably one that can operate in a single pass over the data.