views:

89

answers:

2

Dear Coding Experts

I have a project in which i need to be able to calculate different voting power indexes in R. As a first attempt at this I wrote a small function to calculate the banzhaf index. It takes two arguments, a dataframe that has two columns which must be labelled member and vote, and how many votes are needed for a majority (quota):

library(combinat)
banzhaf <- function(data,quota){
 f <- vector()
 m <- vector()
 score <- vector()
 name <- vector()
 pivot <- vector()
 for (n in 1:nrow(data)){
  y <- as.matrix(combn(data$member,n))
  for (i in 1:ncol(y)){
   for ( j in 1:n){
    f[j] <- data[data$member == y[j,i],]$vote
    m[j] <- as.character(data[data$member == y[j,i],]$member)
    o <- data.frame(member = m, vote = f)
    }

   if (sum(o$vote) >= quota){
    for (k in 1:length(o$member)){
     t <- o[-k,]
    if (sum(t$vote) < quota){
     pivot[length(pivot) + 1] <- as.character(o$member[k])
     }
    }
   }
  }
 }

 for (l in unique(pivot)){
  score[length(score) + 1] <- sum(pivot == l)
  name[length(name) + 1] <- l
  }
 out <- data.frame(name = name, score = score/length(pivot))
 return(out)
}

The problem with this function is that it becomes incredibly slow when i have more than 8 members in the dataframe. This is due to the combn() function used in the outermost loop (I think). Does anyone know how this can be made to run faster?

Best, Thomas

P.S: If you want to test it use the following data, but beware that it might run forever!

x <- c("Germany","France","UK","Italy","Spain","Poland","Romania","Netherlands","Greece","Portugal","Belgium","Czech Rep.","Hungary","Sweden","Austria","Bulgaria","Denmark","Slovakia","Finland","Ireland","Lithuania","Latvia","Slovenia","Estonia","Cyprus","Luxembourg","Malta")
z <- c(29,29,29,29,27,27,14,13,12,12,12,12,12,10,10,10,7,7,7,7,7,4,4,4,4,4,3)

dat <- data.frame(member = as.character(x),vote = z)

oi <- banzhaf(dat, 255)
oi
+1  A: 

Your example data frame has 27 rows and you're looking at every set (except the null set) so that's 2^27 - 1 = 134 217 727 operations at least... this is going to take some time. That said, here's what I believe to be a more efficient version of your code. It seems to match the Wikipedia article at least: http://en.wikipedia.org/wiki/Banzhaf_power_index

banzhaf1 <- function(data, quota) {
  n <- nrow(data)
  vote <- data$vote
  swingsPerIndex <- numeric(n)
  for (setSize in 1:n) {
    sets <- utils::combn(n, setSize)
    numSets <- ncol(sets)
    flatSets <- as.vector(sets)
    voteMatrix <- matrix(vote[flatSets], nrow=setSize, ncol=numSets)
    totals <- colSums(voteMatrix)
    aboveQuota <- totals >= quota
    totalsMatrix <- matrix(rep(totals, each=setSize), nrow=setSize, ncol=numSets)
    winDiffs <- totalsMatrix[, aboveQuota] - voteMatrix[, aboveQuota]
    winSets <- sets[, aboveQuota]
    swingers <- as.vector(winSets[winDiffs < quota])
    swingsPerIndex <- swingsPerIndex + tabulate(swingers, n)
  }
  return(data.frame(name=data$member, score=swingsPerIndex / sum(swingsPerIndex)))
}

(I haven't tried running this on the full data set.)

I think to really approach this problem efficiently, you'll have to take advantage of the structure of the problem. For instance, once you know set X has vote sum above quota, then you know that X union Y is also above quota. I'm not sure if R will be well-suited to following such structure.

David F
+1  A: 

My approach was similar to David's, using batched matrix operations to handle the size:

banzhaf = function(votes, pass=sum(votes) %/% 2 + 1, batch.size=500000, quiet=batches == 1) {
  n = length(votes)
  batches = ceiling((2^n / batch.size))
  if (!quiet)
    cat('calculating...\n')
  Reduce(`+`, lapply(1:batches, function(b) {
    if (!quiet)
      cat('-', b, '/', batches, '\n')
    i = ((b - 1) * batch.size + 1):min(2^n, b * batch.size)
    m = do.call(cbind, lapply(as.integer(2^((1:n) - 1L)), function(j, k) (k %/% j) %% 2L, i))
    x = drop(m %*% votes)
    passed = x >= pass
    colSums((outer(x[passed] - pass, votes, `<`) * m[passed, , drop=F]))
  }))
}

Uses R's name propagation instead of a data.frame, avoid loops where possible, and use integers instead of numerics if possible. Still took over 6 minutes to run on my box:

# wikipedia examples
banzhaf(c(A=4, B=3, C=2, D=1), 6)
banzhaf(c('Hempstead #1'=9, 'Hempstead #2'=9, 'North Hempstead'=7, 'Oyster Bay'=3, 'Glen Cove'=1, 'Long Beach'=1), 16)

# stackoverflow data
system.time(banzhaf(setNames(as.integer(z), x), 255))

The thinking went something like:

  • 2^n possible outcomes (2 outcomes per player, n independent players)
  • represented by the numbers the 1:2^n (cf 'i')
  • expressing the number in binary gives each player's vote.
  • using modulus and division to extract the bits into a voting matrix (cf 'm'), in lieu of bitwise ops (only added to R recently I believe).

After that I think it plays out in the same manner as David's. The only complication was ensuring use of integers for efficiency, and adding the batching as its not really feasible to create a matrix of 27:2^27!

Charles
Could you say more about what this code is doing, particularly this bit?
David F
this bit: lapply(as.integer(2^((1:n) - 1L)), function(j, k) (k %/% j) %% 2L, i)
David F
(answer moved to the body text for formatting)
Charles
Cool. Good to have this method in mind for extracting bits. Thanks!
David F