views:

99

answers:

2

Given a directed graph, the goal is to combine the node with the nodes it is pointing to and come up with minimum number of these [lets give the name] super nodes.

The catch is once you combine the nodes you can't use those nodes again. [first node as well as all the combined nodes - that is all the members of one super node]

The greedy approach would be to pick the node with maximum out degree and combine that node with nodes it is pointing to and remove all of them. Do this every time with the nodes which are not removed yet from graph.

The greedy is O(V), but this won't necessarily output minimum number super nodes. So what is the best algorithm to do this?

+2  A: 

I'll try to characterize the problem in a different way. You want to separate the vertices into two groups. The first group consists of "root nodes", and the second of "dependent nodes", i.e. nodes directly connected to root nodes.

        root     dependent
                  B
           A  <
                  C

                  E
           D  <
                  F

Fig. 0: A possible result graph.

You want to minimize the number of root nodes. This is the same as maximizing the number of dependent nodes, which is the same as maximizing the number of edges in the resulting graph, which is the same as minimizing the number of edges removed in constructing it from the start graph.

Let's first look at brute force: For each bipartition of the nodes into a root and a dependent set, first check if it fulfills the criteria of the problem statement, finally take the best possible. There are an exponential number of bipartitions, so this will have to be refined.

If we regard each edge as having the possible status "unknown", "take", or "remove", taking an edge removes all edges originating from its end node, as well as all edges ending there (see Fig. 1). However, we can retain at most one of the edges ending in a specific node, anyway.

    A B C
     \|/
      D
     /|\
    E F G

Fig 1: Taking the edge A-D removes all other edges diplayed here.

There are a number of "greedy" heuristics:

  • Put the node with the most outgoing edges into the root set and all its connected nodes into the dependent set (I think that this was your proposal).

This has the problem that some of the connected nodes would be better put into the root set, which brings us to the first refinement:

  • Put the node with the most outgoing connections into the root set and mark all its connected nodes as "reachable". Then loop, but only count connections to nodes that are not "reachable" at each point. Stop when all nodes are in the root set or "reachable". Finally, put all "reachable" nodes that are not in the root set into the dependent set, and arbitrarily distribute them among the root nodes they can be connected to.

This looks quite OK, but still not necessarily optimal. Perhaps you can start from the other side:

  • Put the node with the least number of outgoing connections into the dependent set. "Remove" all its outgoing edges. "Take" its incoming edge that comes from a node with the highest number of outgoing connections, and put that node into the root set. Loop until all nodes are either in the root or dependent set.

I still can't prove whether this might be optimal, but at least I cannot directly think of a situation that breaks it.

Svante
+1  A: 

20! is rather large, larger than 2^61. Fortunately there's a better way to solve small instances: (EDIT) dynamic programming. By saving the optimal solutions to each subproblem, we pay some memory to obtain a very large savings in time.

Here's some sample code in Python. In implementing the code below in another language, you'll probably want to number the vertices 0, ..., n-1 and implement the sets as bit vectors.

# find a smallest node closure of G
# G is a graph in adjacency-list format: G[v] is the set of neighbors of v
def node_closure(G):
    # maps subsets of vertices to their smallest node closure
    smallest = {frozenset(): []}
    def find_smallest(S):
        if S in smallest:
            return smallest[S]
        else:
            candidates = [[v] + find_smallest(S - frozenset([v]) - G[v]) for v in S]
            return min(candidates, key=len)
    return find_smallest(frozenset(G))

This problem has an NP-hardness reduction from set cover that preserves the objective value. This means that unless P = NP, the best guarantee you can obtain for a polynomial-time algorithm is that it always outputs a solution that is at most O(log n) times larger than optimal.

If x1, ..., xm are the elements to be covered and S1, ..., Sn are the sets, then the goal of set cover is to choose the minimum number of sets whose union is {x1, ..., xm}. Assuming that each element appears in at least one set, make a graph with nodes x1, ..., xm, S1, ..., Sn, R where there are arcs from R to all Si and for all i, j, an arc from Si to xj if and only if xj belongs to Si. There's a straightforward correspondence between node closures and set covers: to obtain a node closure from a set cover, remove the vertices corresponding to the sets chosen and then R; to obtain a set cover from a node closure, take all sets whose vertices were chosen plus sets containing each xj whose vertex was chosen.

(Note, for set cover, the greedy algorithm achieves the optimal approximation ratio! Something similar might be true for your problem.)

Thanks algorithmist! Yes - it doesn't even look like NP-complete to me [Given an answer, how are we going to verify it is optimal solution in polynomial time?]. It definitely looks outside NP to me.I am fine with a non polynomial algorithm, if that's the best we can do, my graph is pretty small[#(V) = 20]. For example generating all permutations is exponential time, but we use it all the time for small sequences. In fact brute force solution is to generate permuataions of V and for each permutation form super nodes in the order [of that permutation] and pick minimum.
Fakrudeen
Yes - I forgot the power of exponential - 20! operations in a Ghz processor will take 71 years! Currently I am only working with #(V)=15 [~20 minutes] and I added 5 more just to be safe! Dynamic programming looks like the way to go!
Fakrudeen