tags:

views:

136

answers:

1

Hi everyone,

I have toyed with a number of ideas to do this, but so far have only come up with some rather inelegant solutions. I'm sure I could make it work, but the code would neither be pretty nor efficient. Here's the problem:

I have a series of integer pairs that are presented as rows in a two-column data frame. The goal is three-fold:

  1. You need to "eliminate" all the rows in this data frame. To "eliminate" a row, you must select either one of the units from that pair and send/save it to a vector of "selected" elements.

  2. You must find the smallest possible combination of "selected elements" that will eliminate all the pairs in the data frame.

  3. The code must be computationally efficient because it will be applied to rather large datasets.

For instance, one would choose items "1" and "2" from the following list of pairs:

1   3
1   4
2   5
3   2

The data below can be used as a working example.

Thanks!

Vincent


Update for some context:

Hi Cipi and SiggyF.

I understand your concerns about this being homework, so in case you read this again, here's some context that I hope may dispell your doubts.

I am working with time-series cross sectional data in which N is much larger than T. I would like to use panel-corrected standard errors like those proposed in Beck & Katz (1995). The packages "pcse" is mostly able to do this just fine. When you have an unbalanced panel, it essentially creates a "rectangular" dataset (every time units has the full amount of observations) by filling in missing values for the omitted observations in every panel. Then, pcse computes a matrix Sigma.hat which is essentially the weighted average of the outer product of the residuals within time periods (think of it as averaging over an N X N X T array to bring it down to a N X N Sigma.hat).

The problem is that if any two units have zero contemporaneous observation, then the corresponding cell in Sigma.hat will be NA, and pcse won't be able to use it to get the sandwich estimator of the variance covariance matrix. In my example, the data frame numbers correspond to the index of the missing values in Sigma.hat. I want to trim down Sigma.hat automatically, to get an estimate of the VCOV that uses the most information possible, hence my desire to keep as many of the numbers in the data frame.

This is probably very unclear to anyone who hasn't looked into pcse, but I hope you get the gist of it.

Sorry to have given an impression of impropriety, but I assure you, this is legit.


test<-structure(list(row = c(27L, 44L, 45L, 111L, 128L, 129L, 195L, 212L, 213L, 279L, 296L, 297L, 363L, 380L, 381L, 7L, 91L, 175L, 259L, 343L, 44L, 45L, 70L, 128L, 129L, 154L, 212L, 213L, 238L, 296L, 297L, 322L, 380L, 381L, 406L, 7L, 37L, 48L, 91L, 121L, 132L, 175L, 205L, 216L, 259L, 289L, 300L, 343L, 373L, 384L, 7L, 37L, 48L, 91L, 121L, 132L, 175L, 205L, 216L, 259L, 289L, 300L, 343L, 373L, 384L, 44L, 45L, 128L, 129L, 212L, 213L, 296L, 297L, 380L, 381L, 37L, 121L, 205L, 289L, 373L, 27L, 44L, 45L, 111L, 128L, 129L, 195L, 212L, 213L, 279L, 296L, 297L, 363L, 380L, 381L, 7L, 91L, 175L, 259L, 343L, 44L, 45L, 70L, 128L, 129L, 154L, 212L, 213L, 238L, 296L, 297L, 322L, 380L, 381L, 406L, 7L, 37L, 48L, 91L, 121L, 132L, 175L, 205L, 216L, 259L, 289L, 300L, 343L, 373L, 384L, 7L, 37L, 48L, 91L, 121L, 132L, 175L, 205L, 216L, 259L, 289L, 300L, 343L, 373L, 384L, 44L, 45L, 128L, 129L, 212L, 213L, 296L, 297L, 380L, 381L, 37L, 121L, 205L, 289L, 373L, 27L, 44L, 45L, 111L, 128L, 129L, 195L, 212L, 213L, 279L, 296L, 297L, 363L, 380L, 381L, 7L, 91L, 175L, 259L, 343L, 44L, 45L, 70L, 128L, 129L, 154L, 212L, 213L, 238L, 296L, 297L, 322L, 380L, 381L, 406L, 7L, 37L, 48L, 91L, 121L, 132L, 175L, 205L, 216L, 259L, 289L, 300L, 343L, 373L, 384L, 7L, 37L, 48L, 91L, 121L, 132L, 175L, 205L, 216L, 259L, 289L, 300L, 343L, 373L, 384L, 44L, 45L, 128L, 129L, 212L, 213L, 296L, 297L, 380L, 381L, 37L, 121L, 205L, 289L, 373L, 27L, 44L, 45L, 111L, 128L, 129L, 195L, 212L, 213L, 279L, 296L, 297L, 363L, 380L, 381L, 7L, 91L, 175L, 259L, 343L, 44L, 45L, 70L, 128L, 129L, 154L, 212L, 213L, 238L, 296L, 297L, 322L, 380L, 381L, 406L, 7L, 37L, 48L, 91L, 121L, 132L, 175L, 205L, 216L, 259L, 289L, 300L, 343L, 373L, 384L, 7L, 37L, 48L, 91L, 121L, 132L, 175L, 205L, 216L, 259L, 289L, 300L, 343L, 373L, 384L, 44L, 45L, 128L, 129L, 212L, 213L, 296L, 297L, 380L, 381L, 37L, 121L, 205L, 289L, 373L, 27L, 44L, 45L, 111L, 128L, 129L, 195L, 212L, 213L, 279L, 296L, 297L, 363L, 380L, 381L, 7L, 91L, 175L, 259L, 343L, 44L, 45L, 70L, 128L, 129L, 154L, 212L, 213L, 238L, 296L, 297L, 322L, 380L, 381L, 406L, 7L, 37L, 48L, 91L, 121L, 132L, 175L, 205L, 216L, 259L, 289L, 300L, 343L, 373L, 384L, 7L, 37L, 48L, 91L, 121L, 132L, 175L, 205L, 216L, 259L, 289L, 300L, 343L, 373L, 384L, 44L, 45L, 128L, 129L, 212L, 213L, 296L, 297L, 380L, 381L, 37L, 121L, 205L, 289L, 373L), col = c(7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 27L, 27L, 27L, 27L, 27L, 37L, 37L, 37L, 37L, 37L, 37L, 37L, 37L, 37L, 37L, 37L, 37L, 37L, 37L, 37L, 44L, 44L, 44L, 44L, 44L, 44L, 44L, 44L, 44L, 44L, 44L, 44L, 44L, 44L, 44L, 45L, 45L, 45L, 45L, 45L, 45L, 45L, 45L, 45L, 45L, 45L, 45L, 45L, 45L, 45L, 48L, 48L, 48L, 48L, 48L, 48L, 48L, 48L, 48L, 48L, 70L, 70L, 70L, 70L, 70L, 91L, 91L, 91L, 91L, 91L, 91L, 91L, 91L, 91L, 91L, 91L, 91L, 91L, 91L, 91L, 111L, 111L, 111L, 111L, 111L, 121L, 121L, 121L, 121L, 121L, 121L, 121L, 121L, 121L, 121L, 121L, 121L, 121L, 121L, 121L, 128L, 128L, 128L, 128L, 128L, 128L, 128L, 128L, 128L, 128L, 128L, 128L, 128L, 128L, 128L, 129L, 129L, 129L, 129L, 129L, 129L, 129L, 129L, 129L, 129L, 129L, 129L, 129L, 129L, 129L, 132L, 132L, 132L, 132L, 132L, 132L, 132L, 132L, 132L, 132L, 154L, 154L, 154L, 154L, 154L, 175L, 175L, 175L, 175L, 175L, 175L, 175L, 175L, 175L, 175L, 175L, 175L, 175L, 175L, 175L, 195L, 195L, 195L, 195L, 195L, 205L, 205L, 205L, 205L, 205L, 205L, 205L, 205L, 205L, 205L, 205L, 205L, 205L, 205L, 205L, 212L, 212L, 212L, 212L, 212L, 212L, 212L, 212L, 212L, 212L, 212L, 212L, 212L, 212L, 212L, 213L, 213L, 213L, 213L, 213L, 213L, 213L, 213L, 213L, 213L, 213L, 213L, 213L, 213L, 213L, 216L, 216L, 216L, 216L, 216L, 216L, 216L, 216L, 216L, 216L, 238L, 238L, 238L, 238L, 238L, 259L, 259L, 259L, 259L, 259L, 259L, 259L, 259L, 259L, 259L, 259L, 259L, 259L, 259L, 259L, 279L, 279L, 279L, 279L, 279L, 289L, 289L, 289L, 289L, 289L, 289L, 289L, 289L, 289L, 289L, 289L, 289L, 289L, 289L, 289L, 296L, 296L, 296L, 296L, 296L, 296L, 296L, 296L, 296L, 296L, 296L, 296L, 296L, 296L, 296L, 297L, 297L, 297L, 297L, 297L, 297L, 297L, 297L, 297L, 297L, 297L, 297L, 297L, 297L, 297L, 300L, 300L, 300L, 300L, 300L, 300L, 300L, 300L, 300L, 300L, 322L, 322L, 322L, 322L, 322L, 343L, 343L, 343L, 343L, 343L, 343L, 343L, 343L, 343L, 343L, 343L, 343L, 343L, 343L, 343L, 363L, 363L, 363L, 363L, 363L, 373L, 373L, 373L, 373L, 373L, 373L, 373L, 373L, 373L, 373L, 373L, 373L, 373L, 373L, 373L, 380L, 380L, 380L, 380L, 380L, 380L, 380L, 380L, 380L, 380L, 380L, 380L, 380L, 380L, 380L, 381L, 381L, 381L, 381L, 381L, 381L, 381L, 381L, 381L, 381L, 381L, 381L, 381L, 381L, 381L, 384L, 384L, 384L, 384L, 384L, 384L, 384L, 384L, 384L, 384L, 406L, 406L, 406L, 406L, 406L)), .Names = c("row", "col" ), row.names = c(NA, -400L), class = "data.frame")

+3  A: 

Ok, if you consider your elements as vertices, and your pairs as edges of a graph, and your problem becomes a case of the well known (and NP complete) vertex cover problem. You can easily find an approximate solution, guaranteed to be within a factor of two of optimal by choosing an arbitrary edge, and selecting both vertices, removing all eliminated edges, lather, rinse, repeat. You can do incrementally better with more complicated approximation algorithms, but finding the optimal solution with a large graph is probably not feasible.

Here is a simple function to do this. (Note R is not my native language, so this is probably hideously non idomatic, any suggestions for improvement would be appreciated).

good <- function(dat, result = NULL) {
 sampr <- dat[sample(1:(dim(dat)[1]),1),]
 if (dim(dat)[1] == 0){
    result
  } else {
    good(subset(dat, row != sampr$row & row != sampr$col & col != sampr$row & 
                     col != sampr$col),result = c(result, sampr$row, sampr$col))
  }
}

I'd run this a number of times and keep the best one. (It might also be useful to keep track of the size of the worst one, as it gives you a lower bound on the optimal size). It might be useful to postprocess the result to remove excess vertices.

Running 10000 iterations (and removing redundant vertices) gives the following 19 element solution to your sample problem.

7 37 45 48 91 121 128 132 175 205 212 216 259 279 289 300 343 373 384

We also know that the optimal solution must have at least 15 vertices.

deinst
Thanks for the info about vertex cover. I would not have thought of it this way. Knowing that I probably won't be able to reliably find a perfectly optimal answer in large datasets is very useful. Your solution is very close to what I was going to implement. However, I was thinking about doing a "guided" search instead of a purely random one. For instance, the loop could start with numbers that appear most often in the data frame. That would probably save me some computation time and would likely get me an OK answer. Thanks a lot for your input deinst. I really appreciate it.
Vincent
No problem. There are more complicated approximation methods using linear programming ideas.
deinst
Cool!Taking the simpler approach of eliminating the elements that appear most frequently in this data frame, I get a 20 element solution: 7 37 44 45 91 121 128 129 175 205 212 213 259 289 296 297 343 373 380 381Pretty close for such a simple rule. Although I do realize the difference might be larger in bigger samples (but the computation time for so many iterations would also be much larger...).
Vincent