I'll try to characterize the problem in a different way. You want to separate the vertices into two groups. The first group consists of "root nodes", and the second of "dependent nodes", i.e. nodes directly connected to root nodes.
root dependent
B
A <
C
E
D <
F
Fig. 0: A possible result graph.
You want to minimize the number of root nodes. This is the same as maximizing the number of dependent nodes, which is the same as maximizing the number of edges in the resulting graph, which is the same as minimizing the number of edges removed in constructing it from the start graph.
Let's first look at brute force: For each bipartition of the nodes into a root and a dependent set, first check if it fulfills the criteria of the problem statement, finally take the best possible. There are an exponential number of bipartitions, so this will have to be refined.
If we regard each edge as having the possible status "unknown", "take", or "remove", taking an edge removes all edges originating from its end node, as well as all edges ending there (see Fig. 1). However, we can retain at most one of the edges ending in a specific node, anyway.
A B C
\|/
D
/|\
E F G
Fig 1: Taking the edge A-D removes all other edges diplayed here.
There are a number of "greedy" heuristics:
- Put the node with the most outgoing edges into the root set and all its connected nodes into the dependent set (I think that this was your proposal).
This has the problem that some of the connected nodes would be better put into the root set, which brings us to the first refinement:
- Put the node with the most outgoing connections into the root set and mark all its connected nodes as "reachable". Then loop, but only count connections to nodes that are not "reachable" at each point. Stop when all nodes are in the root set or "reachable". Finally, put all "reachable" nodes that are not in the root set into the dependent set, and arbitrarily distribute them among the root nodes they can be connected to.
This looks quite OK, but still not necessarily optimal. Perhaps you can start from the other side:
- Put the node with the least number of outgoing connections into the dependent set. "Remove" all its outgoing edges. "Take" its incoming edge that comes from a node with the highest number of outgoing connections, and put that node into the root set. Loop until all nodes are either in the root or dependent set.
I still can't prove whether this might be optimal, but at least I cannot directly think of a situation that breaks it.