views:

238

answers:

2

I have a DAG. I have this operation to add a edge between two nodes.

If A is reachable from B, then B is A's parent. If A is reachable from B without going though another node, then B is A's direct parent.

Requirements for this graph are:

  1. No cycles.
  2. For any node, there is a list of direct parents P[1],P[2],P[3]... P[i] is not a parent of P[j] for any i and j.

If adding a edge, requirement 1 is not met, the edge is not constructed. If adding a edge, requirement 2 is not met, the edge is constructed, but the direct parents will be modified in a way such that requirement 2 is met.

For example, there are 3 nodes

  • A, direct parents: none
  • B, direct parents: A
  • C, direct parents: A

now, if I add a edge between B and C, we have

  • C, direct parents: A,B

but A is a parent of B, doesn't meet requirement 2, thus A is removed from C's direct parent, and we have

  • C, direct parents: B

Currently here is what I did: Add edge from A to B (this A become B's parent)

  1. Check if B is A's parent by BFS. If so, do not add the edge.(this make sure there is no cycles)
  2. Check if A is already B's parent by BFS. If so, do not add the edge.
  3. Find the intersection of A's parent with B's direct parent. This is done by find every parent of A though BFS. Remove the intersection from B's direct parent, and add A as a direct parent of B.(2 and 3 make sure it meet requirement 2)

This is slow. It breaks down at 5k node level(I'm looking for this to handle any graph with less than 100k nodes), the speed become not acceptable, 0.02 second for adding a node edge.

I have a feeling step 1 and 2 can be done in 1 step with some other algorithm.

I thought of using topological ordering, but it have to transverse the entire graph, which is the worst case for my step 1&2. The ordering will be disrupted when the new node is added. so I have to run a topological ordering every time for a insert, so it doesn't create any benefit. For step 3, the entire set of A's parents have to be found. The process is fairly slow, as on average it transverse a decent part of the graph.

How can I make this more efficient?

A: 

Don't know about your implmentation, but recommend you add indexing. Each row index must store pairs metaparent-metachild

metaparent - is parent in link: parent, parent-of-parent,...

metachid - is child in link: child, child-of-child, ...

so for graph A->B->C exists following indexes: A-B, B-C, A-C Adding explicit edge between B->C causes an assertion since such entry already exist. So complexity of algorithm reduces from n^2 to ln(n)

Dewfy
I think the time and space costs of maintaining such a "reachability cache" would actually make performance much worse. Think about the amount of entries in the cache.
Wim Coenen
Not so much if you follow joining 2 strategies: (1) use bit for coding nodes and Lempel Ziv compression, so most popular nodes has shortest index. (2) join node index into binary string. So for you sample about 100k node you will consume twice more memory near 300k.
Dewfy
The possible number of entries in the cache rises as n^2. For 100k nodes the cache can have up to 5 billion entries (100,000^2 possible pairs, divided by 2 because it's a DAG). Still convinced?
Wim Coenen
But you have put restriction (see you own #2) this kind of graph will be approximately near the tree. My solution provides really near the 300k. Review datastructure "trie" - A-B, A-C will place A only once.
Dewfy
+1  A: 

Your question comes down to "Can inserting edges in a DAG be done faster than O(v+e)?" as per requirement (1). Requirement (2) is a more local constraint that doesn't require checking the entire graph.

I think the answer is no: you can't do better than O(v+e) in the worst case (where v is the number of nodes/vertices and e is the number of edges).

No doubt there are tricks to improve expected performance, depending on the properties of your DAG and how it changes over time. This appears to be an active research topic. For example, I imagine for some graphs it might be beneficial to cluster nodes. Inserting edges within a cluster then only requires checks within the cluster sub-DAG. But then you'd need a proper clustering strategy with support for cheaply updating the clustering when adding nodes, etc.

Wim Coenen
I have gone around and did some search, it is a active field of research.but for requirement 2, I think it's possible to just do a transitive reduction once in a while to save time.
Mgccl