views:

1143

answers:

3

Given a large sparse matrix (say 10k+ by 1M+) I need to find a subset, not necessarily continuous, of the rows and columns that form a dense matrix (all non-zero elements). I want this sub matrix to be as large as possible (not the largest sum, but the largest number of elements) within some aspect ratio constraints.

Are there any known exact or aproxamate solutions to this problem?

A quick scan on Google seems to give a lot of close-but-not-exactly results. What terms should I be looking for?


edit: Just to clarify; the sub matrix need not be continuous. In fact the row and column order is completely arbitrary so adjacency is completely irrelevant.


A thought based on Chad Okere's idea

  1. Order the rows from largest count to smallest count (not necessary but might help perf)
  2. Select two rows that have a "large" overlap
  3. Add all other rows that won't reduce the overlap
  4. Record that set
  5. Add whatever row reduces the overlap by the least
  6. Repeat at #3 until the result gets to small
  7. Start over at #2 with a different starting pair
  8. Continue until you decide the result is good enough
A: 

EDIT. This is NOT the same as the problem below.. My bad...

But based on the last comment below, it might be equivilent to the following:

  1. Find the furthest vertically separated pair of zero points that have no zero point between them.
  2. Find the furthest horizontally separated pair of zero points that have no zeros between them ?
  3. Then the horizontal region you're looking for is the rectangle that fits between these two pairs of points?

    This exact problem is discussed in a gem of a book called "Programming Pearls" by Jon Bentley, and, as I recall, although there is a solution in one dimension, there is no easy answer for the 2-d or higher dimensional variants ...

The 1=D problem is, effectively, find the largest sum of a contiguous subset of a set of numbers:

iterate through the elements, keeping track of a running total from a specific previous element, and the maximum subtotal seen so far (and the start and end elemnt that generateds it)... At each element, if the maxrunning subtotal is greater than the max total seen so far, the max seen so far and endelemnt are reset... If the max running total goes below zero, the start element is reset to the current element and the running total is reset to zero ...

The 2-D problem came from an attempt to generate a visual image processing algorithm, which was attempting to find, within a stream of brightnesss values representing pixels in a 2-color image, find the "brightest" rectangular area within the image. i.e., find the contained 2-D sub-matrix with the highest sum of brightness values, where "Brightness" was measured by the difference between the pixel's brighness value and the overall average brightness of the entire image (so many elements had negative values)

EDIT: To look up the 1-D solution I dredged up my copy of the 2nd edition of this book, and in it, Jon Bentley says "The 2-D version remains unsolved as this edition goes to print..." which was in 1999.

Charles Bretana
I'm not following. I don't think the problem can exist at all in 1-D.
BCS
Theoretically, it can, but not with your given specs.
Sev
The only 1-D version I can think of amounts to "select all non zeros form a list" but that's not interesting past first year CS.
BCS
Key word in the 1-D version that makes it slightly interesting is "find the largest **continuous** set of non-zero's in a list."
Sev
This answers a totally different question. In the OP's question he wants to select a subset (not necessarily continuous) which is all non-zero (not highest sum)
yairchu
Then it is trivialo, just select all the positve values... I tool it from "... form a dense matrix..." that it had to be a contiguous subset in the original matrix.. " But you're right Idid not include the restriction that the selected region cannot have negative values...
Charles Bretana
positive/negative is irrelevant as I don't care at all what values the sub matrix has, just that no elements are zero. For that matter you can think of it as a Boolean matrix.
BCS
A: 

Is this a Netflix problem?

MATLAB or some other sparse matrix libraries might have ways to handle it.

Is your intent to write your own?

Maybe the 1D approach for each row would help you. The algorithm might look like this:

  1. Loop over each row
  2. Find the index of the first non-zero element
  3. Find the index of the non-zero row element with the largest span between non-zero columns in each row and store both.
  4. Sort the rows from largest to smallest span between non-zero columns.

At this point I start getting fuzzy (sorry, not an algorithm designer). I'd try looping over each row, lining up the indexes of the starting point, looking for the maximum non-zero run of column indexes that I could.

You don't specify whether or not the dense matrix has to be square. I'll assume not.

I don't know how efficient this is or what its Big-O behavior would be. But it's a brute force method to start with.

duffymo
"The Netflix problem? No, why do you ask? ;)" I'm not sure if you are assuming I want a continuous sub matrix or not. For the record, I don't care if the result is a continuous sub matrix.
BCS
Why do I ask? Mere curiosity. As far as continuity goes, I'll assume that you mean you can reorder rows to make them continuous if you wish. Calculate the starting index and max length for each row and then determine the largest dense submatrix from that result.
duffymo
That first bit was supposed to be ironic (in fact it is related to the general form of "the netflix problem"). What I'm not getting is what 'length' you are referring to. I am free to reorder both rows and columns so the closest I have to a length is a non-zero element count.
BCS
+1  A: 

I assume you want something like this. You have a matrix like

1100101
1110101
0100101

You want columns 1,2,5,7 and rows 1 and 2, right? That submatrix would 4x2 with 8 elements. Or you could go with columns 1,5,7 with rows 1,2,3 which would be a 3x3 matrix.

If you want an 'approximate' method, you could start with a single non-zero element, then go on to find another non-zero element and add it to your list of rows and columns. At some point you'll run into a non-zero element that, if it's rows and columns were added to your collection, your collection would no longer be entirely non-zero.

So for the above matrix, if you added 1,1 and 2,2 you would have rows 1,2 and columns 1,2 in your collection. If you tried to add 3,7 it would cause a problem because 1,3 is zero. So you couldn't add it. You could add 2,5 and 2,7 though. Creating the 4x2 submatrix.

You would basically iterate until you can't find any more new rows and columns to add. That would get you too a local minimum. You could store the result and start again with another start point (perhaps one that didn't fit into your current solution).

Then just stop when you can't find any more after a while.

That, obviously, would take a long time, but I don't know if you'll be able to do it any more quickly.

Chad Okere
that would work but I think it will be `O(n!)` or worse. OTOH I did give me an idea...
BCS
BCS: You're essentially looking for an element in the direct product of the power set of rows and the power set of columns. Power set searchers are usually going to be (IIRC) NP-Hard. You have to find some approximate solution. The idea with the algorithm I posted above is that you just keep doing it, hopefully finding better and better solutions with each run. How long it will actually take depends on the matrix. A fully dense matrix would just take O(max(n,m)) (because there are no conflicts) where n and m are the number of rows and columns.
Chad Okere
I suspected that it was NP-hard
BCS