views:

229

answers:

6

Hello,

I have a cost optimization request that I don't know how if there is literature on. It is a bit hard to explain, so I apologize in advance for the length of the question.

There is a server I am accessing that works this way:

  • a request is made on records (r1, ...rn) and fields (f1, ...fp)
  • you can only request the Cartesian product (r1, ..., rp) x (f1,...fp)
  • The cost (time and money) associated with a such a request is affine in the size of the request:

T((r1, ..., rn)x(f1, ..., fp) = a + b * n * p

Without loss of generality (just by normalizing), we can assume that b=1 so the cost is:

T((r1, ...,rn)x(f1,...fp)) = a + n * p

  • I need only to request a subset of pairs (r1, f(r1)), ... (rk, f(rk)), a request which comes from the users. My program acts as a middleman between the user and the server (which is external). I have a lot of requests like this that come in (tens of thousands a day).

Graphically, we can think of it as an n x p sparse matrix, for which I want to cover the nonzero values with a rectangular submatrix:

   r1 r2 r3 ... rp
   ------      ___
f1 |x  x|      |x|
f2 |x   |      ---
   ------
f3
..    ______
fn    |x  x|
      ------

Having:

  • the number of submatrices being kept reasonable because of the constant cost
  • all the 'x' must lie within a submatrix
  • the total area covered must not be too large because of the linear cost

I will name g the sparseness coefficient of my problem (number of needed pairs over total possible pairs, g = k / (n * p). I know the coefficient a.

There are some obvious observations:

  • if a is small, the best solution is to request each (record, field) pair independently, and the total cost is: k * (a + 1) = g * n * p * (a + 1)
  • if a is large, the best solution is to request the whole Cartesian product, and the total cost is : a + n * p
  • the second solution is better as soon as g > g_min = 1/ (a+1) * (1 + 1 / (n * p))
  • of course the orders in the Cartesian products are unimportant, so I can transpose the rows and the columns of my matrix to make it more easily coverable, for example:
   f1 f2 f3
r1  x    x
r2     x 
r3  x    x

can be reordered as

   f1 f3 f2
r1  x  x
r3  x  x
r2       x

And there is an optimal solution which is to request (f1,f3) x (r1,r3) + (f2) x (r2)

  • Trying all the solutions and looking for the lower cost is not an option, because the combinatorics explode:
for each permutation on rows: (n!)
   for each permutation on columns: (p!)
       for each possible covering of the n x p matrix: (time unknown, but large...)
           compute cost of the covering

so I am looking for an approximate solution. I already have some kind of greedy algorithm that finds a covering given a matrix (it begins with unitary cells, then merges them if the proportion of empty cell in the merge is below some threshold).

To put some numbers in minds, my n is somewhere between 1 and 1000, and my p somewhere between 1 and 200. The coverage pattern is really 'blocky', because the records come in classes for which the fields asked are similar. Unfortunately I can't access the class of a record...

Question 1: Has someone an idea, a clever simplification, or a reference for a paper that could be useful? As I have a lot of requests, an algorithm that works well on average is what I am looking for (but I can't afford it to work very poorly on some extreme case, for example requesting the whole matrix when n and p are large, and the request is indeed quite sparse).

Question 2: In fact, the problem is even more complicated: the cost is in fact more like the form: a + n * (p^b) + c * n' * p' , where b is a constant < 1 (once a record is asked for a field, it is not too costly to ask for other fields) and n' * p' = n * p * (1 - g) is the number of cells I don't want to request (because they are invalid, and there is an additional cost in requesting invalid things). I can't even dream of a rapid solution to this problem, but still... an idea anyone?

+1  A: 

I'm sure there's a really good algorithm for this out there somewhere, but here are my own intuitive ideas:

  1. Toss-some-rectangles approach:

    • Determine a "roughly optimal" rectangle size based on a.
    • Place these rectangles (perhaps randomly) over your required points, until all points are covered.
    • Now take each rectangle and shrink it as much as possible without "losing" any data points.
    • Find rectangles close to each other and decide whether combining them would be cheaper than keeping them separate.
  2. Grow

    • Start off with each point in its own 1x1 rectangle.
    • Locate all rectangles within n rows/columns (where n may be based on a); see if you can combine them into one rectangle for no cost (or negative cost :D).
    • Repeat.
  3. Shrink

    • Start off with one big rectangle, that covers ALL points.
    • Look for a sub-rectangle which shares a pair of sides with the big one, but contains very few points.
    • Cut it out of the big one, producing two smaller rectangles.
    • Repeat.
  4. Quad

    • Divide the plane into 4 rectangles. For each of these, see if you get a better cost by recursing further, or by just including the whole rectangle.
    • Now take your rectangles and see if you can merge any of them with little/no cost.\

Also: keep in mind that sometimes it will be better to have two overlapping rectangles than one large rectangle which is a superset of them. E.g. the case when two rectangles just overlap in one corner.

Artelius
You're not limited to rectangles.
wrang-wrang
@wrang-wrang: yes I am.@ Artelius, yes this is true it may be better to have overlapping rectangles than strictly non everlapping ones. I am currently testing a modified version of your 'Grow' solution. I start 1x1 rectangle, then I merge the two less-costly (sparseness wise) rectangles and repeat. It gives a linear list of clusterings, on which I take the minimum cost in this list. But the real problem is not here, but in the transpositions I can make on the rows and the columns, which is what make the combinatorics explode (n! * p!, not accounting for symmetry)
LeMiz
Ah, so r1, ..., rn don't have to be consecutive? I think my head will explode.
Artelius
mine has blown up a while ago...
LeMiz
A: 

I'd consider the n records (rows) and p fields (cols) mentioned in the user request set as n points in p-dimensional space ({0,1}^p) with the ith coordinate being 1 iff it has an X, and identify a hierarchy of clusters, with the coarsest cluster at the root including all the X. For each node in the clustering hierarchy, consider the product that covers all the columns needed (this is rows(any subnode) x cols(any subnode)). Then, decide from the bottom up whether to merge the child coverings (paying for the whole covering), or keep them as separate requests. (the coverings are not of contiguous columns, but exactly the ones needed; i.e. think of a bit vector)

I agree with Artelius that overlapping product-requests could be cheaper; my hierarchical approach would need improvement to incorporate that.

wrang-wrang
A: 

Since your values are sparse, could it be that many users are asking for similar values? Is caching within your application an option? The requests could be indexed by a hash that is a function of (x,y) position, so that you can easily identify cached sets that fall within the correct area of the grid. Storing the cached sets in a tree, for example, would allow you to find minimum cached subsets that cover the request range very quickly. You can then do a linear lookup on the subset, which is small.

ire_and_curses
Hello,we already cache the results, of course. The real problem is that we don't really know how to make the request expire. So for business critical purpose, requesting systems have the option of bypassing the cache for certain values (this is in fact one of the cause of the sparseness of the request).
LeMiz
+1  A: 

Ok, my understanding of the question has changed. New ideas:

  • Store each row as a long bit-string. AND pairs of bit-strings together, trying to find pairs that maximise the number of 1 bits. Grow these pairs into larger groups (sort and try to match the really big ones with each other). Then construct a request that will hit the largest group and then forget about all those bits. Repeat until everything done. Maybe switch from rows to columns sometimes.

  • Look for all rows/columns with zero, or few, points in them. "Delete" them temporarily. Now you are looking at what would covered by a request that leaves them out. Now perhaps apply one of the other techniques, and deal with the ignored rows/cols afterwards. Another way of thinking about this is: deal with denser points first, and then move onto sparser ones.

Artelius
+5  A: 

Selecting the submatrices to cover the requested values is a form of the set covering problem and hence NP complete. Your problem adds to this already hard problem that the costs of the sets differ.

That you allow to permutate the rows and columns is not such a big problem, because you can just consider disconnected submatrices. Row one, columns four to seven and row five, columns four two seven are a valid set because you can just swap row two and row five and obtain the connected submatrix row one, column four to row two, column seven. Of course this will add some constraints - not all sets are valid under all permutations - but I don't think this is the biggest problem.

The Wikipedia article gives the inapproximability results that the problem cannot be solved in polynomial time better then with a factor 0.5 * log2(n) where n is the number of sets. In your case 2^(n * p) is a (quite pessimistic) upper bound for the number of sets and yields that you can only find a solution up to a factor of 0.5 * n * p in polynomial time (besides N = NP and ignoring the varying costs).

An optimistic lower bound for the number of sets ignoring permutations of rows and columns is 0.5 * n^2 * p^2 yielding a much better factor of log2(n) + log2(p) - 0.5. In consequence you can only expect to find a solution in your worst case of n = 1000 and p = 200 up to a factor of about 17 in the optimistic case and up to a factor of about 100.000 in the pessimistic case (still ignoring the varying costs).

So the best you can do is to use a heuristic algorithm (the Wikipedia article mentions an almost optimal greedy algorithm) and accept that there will be case where the algorithm performs (very) bad. Or you go the other way and use an optimization algorithm and try to find a good solution be using more time. In this case I would suggest trying to use A* search.

Daniel Brückner
Thanks for the response. I am pretty aware the solution is NP Hard, but I a looking for a solution that would work well in practice. Also, after careful studying, I think the set covering formulation is not trivial because 1) the cost function is very particular 2) the constraints are also. Time to start a bounty !
LeMiz
A: 

I've worked a bit on it, and here is an obvious, O(n^3) greedy, symmetry breaking algorithm (records and fields are treated separately) in python-like pseudo-code.

The idea is trivial: we start by trying one request per record, and we do the most worthy merge until there is nothing left worthy to merge. This algo has the obvious disadvantage that it does not allow overlapping requests, but I expect it to work quite well on real life case (with the a + n * (p^b) + c * n * p * (1 - g) cost function) :

# given are
# a function cost request -> positive real
# a merge function that takes two pairs of sets (f1, r1) and (f2, r2) 
# and returns ((f1 U f2), (r1 U r2))

# initialize with a request per record

requests = [({record},{field if (record, field) is needed}) for all needed records]
costs = [cost(request) for request in requests]

finished = False

while not finished: # there might be something to gain
    maximum_gain = 0
    finished = True
    this_step_merge = empty

    # loop onto all pairs of request
    for all (request1, request2) in (requests x request) such as request1 != request2:
        merged_request = merge(request1, request2)
        gain = cost(request1) + cost(request2) - cost(merged_request)

        if gain > maximum_gain:
            maximum_gain = gain
            this_step_merge = (request1, request2, merged_request)

    # if we found at least something to merge, we should continue
    if maximum_gain > 0:
        # so update the list of requests...
        request1, request2, merged_request = this_step_merge
        delete request1 from requests
        delete request2 from requests
        # ... and we are not done yet
        insert merged_request into requests
        finished = False

output requests

This is O(n3 * p) because:

  • after initialization we start with n requests
  • the while loop removes exactly one request from the pool at each iteration.
  • the inner for loop iterates on the (ni^2 - ni) / 2 distinct pairs of requests, with ni going from n to one in the worst case (when we merge everything into one big request).

    1. Can someone help me pointing the very bad cases of the algorithm. Does it sound reasonnable to use this one ?
    2. It is O(n^3) which is way too costly for large inputs. Any idea to optimize it ?

Thanks in advance !

LeMiz