views:

70

answers:

3

Given the desired number of partitions, the partitions should be nearly equal in size. This question handles the problem for a list. They do not have the random property, but that is easily added. My problem is, that I have an iterator as input, so shuffle does not apply. The reason for that is that I want to randomly partition the nodes of graph. The graph can be very large, so I am looking for a solution that does not just create an intermediate list.

My first idea was to use compress() with a random number function as selector. But that only works for two partitions.

+1  A: 

You're just dealing to various partitions, right?

def dealer( iterator, size ):
    for item in iterator
        yield random.randrange( size ), item

Won't that get you started by assigning each item to a partition?

Then you can do something like this to make lists. Maybe not a good thing, but it shows how to use the function.

def make_lists( iterator, size ):
    the_lists = []*size
    for partition, item in dealer( iterator, size ):
        the_lists[partition].append(item)
    return the_lists
S.Lott
You could shorten the second part by using groupby, from itertools.
wheaties
@wheaties: while true, it's not clear what the partitions are being used for.
S.Lott
+1  A: 

You could just create k list. When you receive a value, pick a random integer x between 0 and k-1, and put that value into the x-th list.

On average each list will contain N/k elements, but with a standard deviation of √(N * 1/k * (1-1/k)).

def random_partition(k, iterable):
  results = [[] for i in range(k)]
  for value in iterable:
    x = random.randrange(k)
    results[x].append(value)
  return results
KennyTM
Can you give a source or explanation for that standard deviation?
Space_C0wb0y
@Space: This is just a binomial distribution.
KennyTM
A: 

You can make the length of lists more uniform by adjusting the weights depending on the number of nodes generated so far in each partition. They'll be roughly equal-length if you pick a function so that the weight is 0 when (number of nodes in partition n) > (number of nodes)/(number of partions), i.e.

weight[i] = max(numNodes/numPartitions - nodesSoFar[i],0)

(The max() is to stop negative weights, which might happen if you have 4 nodes and 3 partitions.)

Then pick a random number from 1 to sum(weights) (or 0 to sum(weights)-1) and pick the partition appropriately.

compress() works provided you use a different selector per partition; something like (x == n for x in random_partition_numbers) where random_partition_numbers is a generator. You'll need to copy random_partition_numbers for each partition, of course. This design is inherently slower, since it needs to iterate through the list of nodes for each partition.

tc.