views:

131

answers:

3

I have a problem with describing algorithm for finding maximum rectangular area of binary data, where 1 occurs k-times more often than 0. Data is always n^2 bits like this:

For example data for n = 4 looks like:

1 0 1 0
0 0 1 1
0 1 1 1
1 1 0 1

Value of k can be 1 .. j (k = 1 means, that number of 0 and 1 is equal).

For above example of data and for k = 1 solution is:

1 0 1 0 <- 4 x '0' and 4 x '1'
0 0 1 1
0 1 1 1
1 1 0 1

But in this example:
1 1 1 0
0 1 0 0
0 0 0 0
0 1 1 1

Solution would be:
1 1 1 0
0 1 0 0
0 0 0 0
0 1 1 1

I tried with few brute force algorithms but for n > 20 it is getting too slow. Can you advise me how I should solve this problem?


As RBerteig proposed - the problem can be also described like that: "In a given square bitmap with cells set to 1 or 0 by some arbitrary process, find the largest rectangular area where the 1's and 0's occur in a specified ratio, k."

+2  A: 

This is still brute force, but something you should note is that you don't have to recompute everything from scratch for a new i*j rectangle. Instead, for each possible rectangle size, you can move the rectangle across the n*n grid one step at a time, decrementing the counts for the bits no longer within the rectangle and incrementing the counts for the bits that newly entered the rectangle. You could potentially combine this with varying the rectangle size, and try to find an optimal pattern for moving and resizing the rectangle.

R..
I was thinking about that solution but when I tried to estimate the number of steps I found it will be too complicated. For n = 200 (40,000 bits) there is 30,000 different sized "windows" which should be checked:len([(x, y) for x in range(1, 201) for y in range(1, 201) if not (x * y) % 2]) -- using Python to calculate that. BUT - each window show be moved across the n*n. This means that rectangle 2x1 should be moved (200 - 1) * (200 - 2) = 39402 times, 3x1 - 39203 ... etc. For n = 200, there will be ~300,000,000 moves.
You could also cache the totals for each sub-rectangle visited, and use them for building aggregated totals of larger sub rectangles.
RBerteig
@RBerteig: For data:0 0 0 0 |0 0 0 0 |1 1 1 1 |1 1 1 1 |How could I use mentioned cache?
Once you've summed the 2x2 at the upper left and got 0, remember that fact and the total for its neighboring 2x2 when calculating the sum for the 2x4 at the top. I think the resulting pyramid of resolutions trades off memory for revisiting the same pixel many times.
RBerteig
+3  A: 

Bruteforce should do just fine here for n < 100, if properly implemented: solution below has O(n^4) time and O(n^2) memory complexity. 10^8 operations should be well under 1 second on modern PC (especially considering that each operation is very cheap: few additions and subtractions).

Some observations

  1. There're O(n^4) sub-rectangles to consider and each of them can be a solution.
  2. If we can find number of 1's and 0's in each sub-rectangle in O(1) (constant time), we'll solve problem in O(n^4) time.
  3. If we know number of 1's in some sub-rectangle, we can find number of zeroes (through area).

So, the problem is reduced to following: create data structure allowing to find number of 1's in each sub-rectangle in constant time.

Now, imagine we have sub-rectangle [i0..i1]x[j0..j1]. I.e., it occupies rows between i0 and i1 and columns between j0 and j1. And let count_ones be the function to count number of 1's in subrectangle.

This is the main observation:

count_ones([i0..i1]x[j0..j1]) = count_ones([0..i1]x[0..j1]) - count_ones([0..i0 - 1]x[0..j1]) - count_ones([0..i1]x[0..j0 - 1]) + count_ones([0..i0 - 1]x[0..j0 - 1])

Same observation with practical example:

AAAABBB
AAAABBB
CCCCDDD
CCCCDDD
CCCCDDD
CCCCDDD

If we need to find number of 1's in D sub-rectangle (3x4), we can do it by taking number of 1's in the whole rectangle (A + B + C + D), subtracting number of 1's in (A + B) rectangle, subtracting number of 1's in (A + C) rectangle, and adding number of 1's in (A) rectangle. (A + B + C + D) - (A + B) - (A + C) + (A) = D

Thus, we need table sums, for each i and j containing number of 1's in sub-rectangle [0..i][0..j].
You can create this table in O(n^2), but even the direct way to fill it (for each i and j iterate all elements of [0..i][0..j] area) will be O(n^4).

Having this table,

count_ones([i0..i1]x[j0..j1]) = sums[i1][j1] - sums[i0 - 1][j1] - sums[i1][j0 - 1] + sums[i0 - 1][j0 - 1]

Therefore, time complexity O(n^4) reached.

Nikita Rybak
I'm afraid that n < 100 is to small data set. Thanks anyway for your investigation.
@alg0 what complexity do you need, then?
Nikita Rybak
n <= 1000 will be fine.
A: 

Just some hints..

You could impose better restrictions on the values. The requirement leads to condition

N1*(k+1) == S*k, where N1 is number of ones in an area, and S=dx*dy is its surface. It can be rewritten in better form:

N1/k == S/(k+1).

Because the greatest common divisor of numbers n and n+1 is always 1, then N1 have to be multiple of k and dx*dy to be multiple of k+1. It reduces greatly the possible space of solutions, the larger is k, the better (for dx*dy case you'll need to play with prime divisors of k+1).

Now, because you need just the surface of the largest area with such property, it would be wise to start from largest areas and move to smaller ones. By trying dx*dy from n^2 downto k+1 that would satisfy the divisor and the bounding conditions, you'll find quite fast the solution, muuuch faster than O(n^4), because of a special reason: except cases when the array was specially constructed, if we assume a random input, the probability that there are N1 ones out of S values in the (n-dx+1)*(n-dy+1) areas that have the surface S will constantly grow with decrease of S. (large values of k will make the probability smaller, but in the same time they will make the filter for dx and dy pairs stronger).

Also, this problem: http://ioinformatics.org/locations/ioi99/contest/land/land.shtml , looks somehow similar, maybe you'll find some ideas in their solution.

ruslik