I have a problem at hand which can be reduced to something like this :

Assume a bunch of random points in a two-dimension plane X-Y where for each Y, there could be multiple points on X and for each X, there could be multiple points on Y.

Whenever a point (Xi,Yi) is chosen, no other point with X = Xi OR Y = Yi can be chosen. We have to choose the maximum number of points.

+3  A: 

Isn't this just the Hungarian algorithm?

Create an n×n matrix, with 0 at marked vertices, and 1 at unmarked vertices. The algorithm will choose n vertices, one for each row and column, which minimizes their sum. Simply count all the chosen vertices which equal 0, and you have your answer.

from munkres import Munkres

matrix = [[0, 0, 1],
          [0, 1, 1],
          [1, 0, 0]]

m = Munkres()
total = 0
for row, column in m.compute(matrix):
    if matrix[row][column] == 0:
        print '(%i, %i)' % (row, column)
        total += 1

print 'Total: %i' % total

This runs in O(n3) time, where n is the number of rows in the matrix. The maximum flow solution runs in O(V3), where V is the number of vertices. As long as there are more than n chosen intersections, this runs faster; in fact, it runs orders of magnitude faster, as the number of chosen vertices goes up.

Chris B.
Might be, but I don't think it's immediately obvious. How would you build the graph? Are you sure it's a bipartite graph?
@ivlad: i'd like your opinion of my new solution. having put more thought into it, there is a better heuristic.
Agree with IVlad, how do you plan to map the points to the matrix?
The points already _are_ a matrix. Put a 0 at each chosen intersection, and a 1 at all the rest.
Chris B.
@CHris: Say I have (0,0) and (10^9, 10^9). What is your n? 10^9?
@Moron: Yes, if you have 10⁹ rows and 10⁹ columns, then you have a 10⁹×10⁹ matrix. Although with such a degenerate case, it's pretty easy to reduce it to something simpler.
Chris B.
@Chris: Well then you haven't really solved it faster than Nikita did... In fact you have made it exponential! Nikita's solution runs in O(n^3) (can do better, but just stating an upper bound) where n is the _number_ of points. In the degenerate case, n=2. So your algorithm is actually much worse.
I have edited the answer to remove my previous +1 and have now converted it to -1 based on the clarification by Chris.
@Moron: This runs in O(n³) time where _n_ is the number of rows. How is it worse than something which runs in O(n³) time where _n_ is the number of _points_, unless there's less than one point in each row?
Chris B.
@CHris: Look at the so called 'degenerate' case. n=2. You have 10^9 rows! While the actual number of rows and columns is just 2. You are just adding a huge number of unnecessary points and increasing the complexity for no reason.
@Moron: Degenerate cases generally perform degenerately. People who choose algorithms which have abysmal performance in common cases just so they have exquisite performance in degenerate cases generally don't have their contracts renewed.
Chris B.
@Chris! You are the one who called it a 'degenerate' case! It is not a degenerate case at all! Please, try to understand what I am trying to say. Also, what would you do if the co-ordinates were floating point?
@Moron: How about the inverse case, where all vertices _except_ for 2 are chosen? How do the algorithms stack up then?
Chris B.
@Chris: Your algorithm will still have problems with floating point. How would you deal with that? Even if the numbers were integers, what if the numbers were large? What is your memory usage? Even if you make things work, Maximum-matching will still beat your algorithm.
@Moron: You are wrong, and it's clear you refuse to listen to reason. This debate is fruitless.
Chris B.
@Chris: Sorry, but _you_ are refusing to see the flaws in your algorithm. Anyway, at least I agree with you that this conversation is pointless. The downvotes stands. Please don't expect any more responses from me.

For each point, identify the number of other points (N) that would be disqualified by the selection of that point (i.e. the ones with the same X or Y values). Then, iterate over the non-disqualified points in order of increasing number of N disqualified points. When you are finished, you will have removed the maximum number of points.

how can you be sure that an early elimination of a small-number-of-disqualifications won't result in a slightly different number of eventual choices? it'll definitely get an approximation, but can you be sure it's a true maximum without trying them all?
Quite right. The greedy approach cannot guarantee optimality in this case.
That's why you have to do a full tree search, but you could prune many subtrees.
Kennet Belenky
+12  A: 

This can be reduced to simple maximum flow problem. If you have a point (xi, yi), in graph it should be represented with path from source S to point xi, from xi to yi and from yi to the last node (sink) T.

Note, if we have points (2, 2) and (2, 5), there's still only one path from S to x2. All paths (edges) have capacity 1.

The flow in this network is the answer.

about general problem

I don't have graphic editor right now to visualise problem, but you can easily draw example by hand. Let's say, points are (3, 3) (3, 5) (2, 5)

Then edges (paths) would be
S -> x2, S -> x3
y3 -> T, y5 -> T
x3 -> y3, x3 -> y5, x2 -> y5

Flow: S -> x2 -> y5 -> T and S -> x3 -> y3 -> T
The amount of 'water' going from source to sink is 2 and so is the answer.

Also there's a tutorial about max flow algorithms

Nikita Rybak
no need of graph editor. This looks elegant and simple. I tried greedy heuristics myself as suggested in one of the replies below on a sample data set and result wasn't the best. Thanks everyone a lot for your replies.
i'd like your opinion of my algorithm to solve this. posted a different one than before.
It might be faster and/or simpler to just run the maximum matching algorithm, rather than a max-flow algorithm.
This is a really excellent insight. I love isomorphisms like this. :)
Nick Johnson
@Moron I guess, but in practice I've never seen much difference in speed or simplicity between max match and max flow algorithms. In fact, all max match algorithms I know are a bit 'overcomplicated' to implement.
Nikita Rybak
@Nikita: Max matching has another advantage apart from being faster... Say I gave weights to the points and wanted one set with the maximum weight. You probably will _need_ maximum weighted matching now and max flow might not work. As to 'complicated', even max-flow is complicated! Fortunately we probably have libs for both.

This looks like a problem that can be solved with dynamic programming. Look into the algorithms for longest common substring, or the knapsack problem.

dynamic programming is certainly not a feasible solution

The XY plane is a red herring. Phrase it as a set of elements, each of which has a set of mutually exclusive elements.

The algorithm then becomes a depth-first search. At each level, for each candidate node, calculate the set of excluded elements, the union of currently excluded elements with the elements excluded by the candidate node. Try candidate nodes in order of fewest excluded elements to most. Keep track of the best solution so far (the fewest excluded nodes). Prune any subtrees that are worse than the current best.

As a slight improvement at the cost of possible missed solutions, you can use Bloom filters for keeping track of the excluded sets.

Kennet Belenky
+1  A: 

Different solution. It turns out that there's a lot of symmetry, and the answer is a lot simpler than I originally thought. The maximum number of things you can ever do is the minimum of the unique X's and unique Y's, which is O(NlogN) if you just want the result.

Every other shape is equivalent to a rectangle that contains the points, because it doesn't matter how many points you pull from the center of a rectangle, the order will never matter (if handled as below). Any shape that you pluck a point from now has one less unique X and one less unique Y, just like a rectangle.

So the optimal solution has nothing to do with connectedness. Pick any point that is on the edge of the smallest dimension (i.e. if len(unique-Xs)>len(unique-Ys), pick anything that has either maximum or minimum X). It doesn't matter how many connections it has, just which dimension is biggest, which can easily be done while looking at the sorted-unique lists created above. If you keep a unique-x and unique-y counter and decrement them when you delete all the unique nodes in that element of the list, then each deletion is O(1) and recalculating the lengths is O(1). So repeating this N times is at worst O(N), and the final complexity is O(NlogN) (due solely to the sorting).

You can pick any point along the shortest edge because:

  • if there's only one on that edge, you better pick it now or something else will eliminate it
  • if there's more than one on that edge, who cares, you will eliminate all of them with your pick anyways

Basically, you're maximizing "max(uniqX,uniqY)" at each point.

Update: IVlad caught an edge case:

If the dimensions are equal, take the edge with the least points. Even if they aren't equal, take the top or bottom of the unique-stack you're eliminating from that has the least points.

Case in point:

Turn 1:

  • Points: (1, 2); (3, 5); (10, 5); (10, 2); (10, 3)
  • There are 3 unique X's: 1, 3, 10
  • There are 3 unique Y's: 2, 3, 5
  • The "bounding box" is (1,5),(10,5),(10,2),(1,2)

Reaction 1:

  • The "outer edge" (outermost uniqueX or uniqueY lists of points) that has the least points is the left. Basically, look at the sets of points in x=1,x=10 and y=2,y=5. The set for x=1 is the smallest: one point. Pick the only point for x=1 -> (1,2).
  • That also eliminates (10,2).

Turn 2:

  • Points: (3, 5); (10, 5); (10, 3)
  • There are 2 unique X's: 3, 10
  • There are 2 unique Y's: 3, 5
  • The "bounding box" is (3,5),(10,5),(10,3),(3,3)

Reaction 2:

  • The "edge" of the bounding box that has the least is either the bottom or the left. We reached the trivial case - 4 points means all edges are the outer edges. Eliminate one. Say (10,3).
  • That also eliminates (10,5).

Turn 3:

  • Points: (3, 5)

Reaction 3:

  • Remove (3,5).
Can you run it on an example, I'm not sure I get it. Tthis input: `(1, 2); (3, 5); (10, 5); (10, 2); (10, 3)` for example.
updated with example.
@eruciform: This is basically the maximum matching (in bipartite graphs) problem rephrased. In fact, maximum matching in bipartite graphs can be reduced to this problem in O(V+E) time. I believe(not exactly sure) that the current best known algorithm for bipartite matching is worse than O(VlogV) which you seem to have... So if it is correct, you might want to publish it :-)
@moron: thanks! this problem definitely tickled my fancy. i'm thinking about solving it in python and having it run a couple thousand random point clusters with this algo and with brute-force, to see if brute force ever turns up more. it's not proof, but it would be strong evidence. if i do, i'll post it here.
Could you clarify what the 'shape' in the second paragraph is? Thanks! And the greedy approach proposed in the first paragraph has a problem: try points (0, 0) (0, 1) (0, 2) (10, 10), (11, 10) (12, 10). There're 4 unique X coordinates, same for Y. But the answer is 2.
Nikita Rybak
@nikita: by "shape" i mean any collection of dots. my algo for what you just said would pick (12,10),(0,0), which is correct: 2 points. the first paragraph just mentions the "maximum allowable" - the answer may be less, but not more.

Based on a recommendation from IVlad, I looked into the Hopcroft–Karp algorithm. It's generally better than both the maximum flow algorithm and the Hungarian algorithm for this problem, often significantly. Some comparisons:

In general:

  • Max Flow: O(V3) where V is the number of vertices.
  • Hungarian: O(n3) where n is the number of rows in the matrix
  • Hopcroft-Karp: O(V √2V) where V is the number of vertices.

For a 50×50 matrix, with 50% chosen vertices:

  • Max Flow: 1,2503 = 1,953,125,000
  • Hungarian: 503 = 125,000
  • Hopcroft-Karp: 1,250 √2,500 = 62,500

For a 1000×1000 matrix, with 10 chosen vertices:

  • Max Flow: 103 = 1,000
  • Hungarian: 10003 = 1,000,000,000
  • Hopcroft-Karp: 10 √20 ≅ 44.7

The only time the Hungarian algorithm is better is when there is a significantly high proportion of points selected.

For a 100×100 matrix, with 90% chosen vertices:

  • Max Flow: 9,0003 = 729,000,000,000
  • Hungarian: 1003 = 1,000,000
  • Hopcroft-Karp: 9,000 √18,000 ≅ 1,207,476.7

The Max Flow algorithm is never better.

It's also quite simple, in practice. This code uses an implementation by David Eppstein:

points = {
    0 : [0, 1],
    1 : [0],
    2 : [1, 2],

selected = bipartiteMatch(points)[0]

for x, y in selected.iteritems():
    print '(%i, %i)' % (x, y)

print 'Total: %i' % len(selected)
Chris B.
I don't think that's right. I'm fairly sure you can implement edmonds-karp on adjacency matrices in `O(n^3)`. Your tests boil down to adjacency lists versus adjacency matrices. It's just that the hungarian algorithm absolutely requires a matrix, but the others don't absolutely require adjacency lists (although I've personally never seen hopcroft-karp implemented with matrices, but I'm pretty sure it can be)
There is no matrix in the original question, BTW: there are just points. If you create an artificially large matrix from the input, its size may not be meaningfully related to the size of the input. Unless of course, you create a matrix with only those "rows" and "columns" for which points exist, in which case it's an mxn matrix with m≤V, n≤V.