views:

210

answers:

5

I have a matrix that I want to randomize a couple of thousand times, while keeping the row and column totals the same:

     1 2 3 
   A 0 0 1 
   B 1 1 0 
   C 1 0 0      

An example of a valid random matrix would be:

     1 2 3
   A 1 0 0
   B 1 1 0
   C 0 0 1

My actual matrix is a lot bigger (about 600x600 items), so I really need an approach that is computationally efficient.

My initial (inefficient) approach consisted of shuffling arrays using the Perl Cookbook shuffle

I pasted my current code below. I've got extra code in place to start with a new shuffled list of numbers, if no solution is found in the while loop. The algorithm works fine for a small matrix, but as soon as I start scaling up it takes forever to find a random matrix that fits the requirements.

Is there a more efficient way to accomplish what I'm searching for? Thanks a lot!

#!/usr/bin/perl -w
use strict;

my %matrix = ( 'A' => {'3'  => 1 },
           'B' => {'1'  => 1,
               '2'  => 1 },
           'C' => {'1'  => 1 }
    );

my @letters = ();
my @numbers = ();

foreach my $letter (keys %matrix){
    foreach my $number (keys %{$matrix{$letter}}){
    push (@letters, $letter);
    push (@numbers, $number);
    }
}

my %random_matrix = ();

&shuffle(\@numbers);
foreach my $letter (@letters){
    while (exists($random_matrix{$letter}{$numbers[0]})){
    &shuffle (\@numbers);
    }
    my $chosen_number = shift (@numbers);
    $random_matrix{$letter}{$chosen_number} = 1;
}

sub shuffle {
    my $array = shift;
    my $i = scalar(@$array);
    my $j;
    foreach my $item (@$array )
    {
        --$i;
        $j = int rand ($i+1);
        next if $i == $j;
        @$array [$i,$j] = @$array[$j,$i];
    }
    return @$array;
}
A: 

Not sure if it will help, but you can try going from one corner and for each column and row you should track the total and actual sum. Instead of trying to hit a good matrix, try to see the total as amount and split it. For each element, find the smaller number of row total - actual row total and column total - actual column total. Now you have the upper bound for your random number. Is it clear? Sorry I don't know Perl, so I cannot show any code.

Gabriel Ščerbák
Your answer isn't very clear, but I think I understand it now. I think you have missed one of the requirements: each cell can only contain 0 or 1. It's not explicit in the question, but it was mentioned in a comment to the question.
Mark Byers
+3  A: 

Step 1: First I would initialize the matrix to zeros and calculate the required row and column totals.

Step 2: Now pick a random row, weighted by the count of 1s that must be in that row (so a row with count 300 is more likely to be picked than a row with weight 5).

Step 3: For this row, pick a random column, weighted by the count of 1s in that column (except ignore any cells that may already contain a 1 - more on this later).

Step 4: Place a one in this cell and reduce both the row and column count for the appropriate row and column.

Step 5: Go back to step 2 until no rows have non-zero count.

The problem though is that this algorithm can fail to terminate because you may have a row where you need to place a one, and a column that needs a one, but you've already placed a one in that cell, so you get 'stuck'. I'm not sure how likely this is to happen, but I wouldn't be surprised if it happened very frequently - enough to make the algorithm unusable. If this is a problem I can think of two ways to fix it:

a) Construct the above algorithm recursively and allow backtracking on failure.

b) Allow a cell to contain a value greater than 1 if there is no other option and keep going. Then at the end you have a correct row and column count but some cells may contain numbers greater than 1. You can fix this by finding a grouping that looks like this:

2 . . . . 0
. . . . . .
. . . . . .
0 . . . . 1

and changing it to:

1 . . . . 1
. . . . . .
. . . . . .
1 . . . . 0

It should be easy to find such a grouping if you have many zeros. I think b) is likely to be faster.

I'm not sure it's the best way, but it's probably faster than shuffling arrays. I'll be tracking this question to see what other people come up with.

Mark Byers
Thanks for this suggestion! It looks like a sound approach which I will definitely try to implement. I will try to see how many times the algorithm has to fill a cell that is already filled.I don't expect that it will happen often, due to the low frequency of 1's in my table, so simply running the algorithm again might also be feasible (if it doesn't happen too often)!
Lucas
I implemented the algorithm as per your suggestions, but I ran in to the problem that the 1s are not randomly distributed over the matrix. Some columns can contain as many as 45-65 of 1's, while the average is only something like 1.5. Thus, when trying to go for option b) I do not only run into 2's, but also into 3, 4, 5, 6, 7's which are not so easily fixed. I'll try to figure a way to build in the backtracking functionality into the algorithm, but I'm afraid that that will require better programming skills than mine ;).
Lucas
Lucas: Did you remember to weight the random selection of columns? You have to pick the columns with the most 1s first, otherwise you increase the chance of blocking yourself. Could you post an example input data somewhere? SO is unsuitable, but perhaps another website, and link to it?
Mark Byers
Also a 7 isn't really a problem if you use algorithm b). You just need to repeat the reorganisation enough times, each time choosing a different block.
Mark Byers
Also remember **not** to include columns already containing a 1 when selecting a random column.
Mark Byers
(unless you have no other choice)
Mark Byers
I didn't weight the random selection, so that's what I'm going to implement now. I'll post a sample matrix on my website tonight, I'll post the link here when it's there.
Lucas
A: 

Hi

Like @Gabriel I'm not a Perl programmer so it's possible that this is what your code already does ...

You've only posted one example. It's not clear whether you want a random matrix which has the same number of 1s in each row and column as your start matrix, or one which has the same rows and columns but shuffled. If the latter is good enough you could create an array of row (or column, it doesn't matter) indexes and randomly permute that. You can then read your original array in the order specified by the randomised index. No need to modify the original array or create a copy.

Of course, this might not meet aspects of your requirements which are not explicit.

Regards

Mark

High Performance Mark
Thanks Mark! But as you've maybe already read in the other comments, I'm looking for a solution for the former problem you've described :).
Lucas
+1  A: 

I'm not a mathematician, but I figure that if you need to keep the same column and row totals, then random versions of the matrix will have the same quantity of ones and zeros.

Correct me if I'm wrong, but that would mean that making subsequent versions of the matrix would only require you to shuffle around the rows and columns.

Randomly shuffling columns won't change your totals for rows and columns, and randomly shuffling rows won't either. So, what I would do, is first shuffle rows, and then shuffle columns.

That should be pretty fast.

Tim Rupe
My understanding of the problem is that the row (and column) totals must stay constant and in the same order. If so, it would be OK to swap two rows that have the same row total, but not if they have different. In the 3x3 example given in the question, it is valid to swap rows A and C because they both have row total 1, but not rows A and B, because row B has a total of 2.
Mark Byers
Randomly shuffling columns does not affect row totals, but it does affect column totals (and the same holds true vice versa).If I move all columns one place to the right in my initial matrix, the total for column 1 is no longer equal to 2..edit: @Mark You're right + your example is phrased better than mine :)
Lucas
Also, some of the possible solutions that the poster might want to generate wouldn't be reachable just by making row and column swaps. E.g. if you had a 4x4 matrix with row and column totals all 2, and initial configuration `1100,1100,0011,0011` I'm not sure how you could change into `1100,1010,0101,0011` just with row and column swaps.
Mark Byers
Yes, I think the description of the problem is a bit confusing. I had the impression that individual column and row totals needed to stay the same, not the position of the totals. Upon another look, the random shuffle of columns and rows wouldn't work.
Tim Rupe
+7  A: 

The problem with your current algorithm is that you are trying to shuffle your way out of dead ends -- specifically, when your @letters and @numbers arrays (after the initial shuffle of @numbers) yield the same cell more than once. That approach works when the matrix is small, because it doesn't take too many tries to find a viable re-shuffle. However, it's a killer when the lists are big. Even if you could hunt for alternatives more efficiently -- for example, trying permutations rather than random shuffling -- the approach is probably doomed.

Rather than shuffling entire lists, you might tackle the problem by making small modifications to an existing matrix.

For example, let's start with your example matrix (call it M1). Randomly pick one cell to change (say, A1). At this point the matrix is in an illegal state. Our goal will be to fix it in the minimum number of edits -- specifically 3 more edits. You implement these 3 additional edits by "walking" around the matrix, with each repair of a row or column yielding another problem to be solved, until you have walked full circle (err ... full rectangle).

For example, after changing A1 from 0 to 1, there are 3 ways to walk for the next repair: A3, B1, and C1. Let's decide that the 1st edit should fix rows. So we pick A3. On the second edit, we will fix the column, so we have choices: B3 or C3 (say, C3). The final repair offers only one choice (C1), because we need to return to the column of our original edit. The end result is a new, valid matrix.

    Orig         Change A1     Change A3     Change C3     Change C1
    M1                                                     M2

    1 2 3        1 2 3         1 2 3         1 2 3         1 2 3
    -----        -----         -----         -----         -----
A | 0 0 1        1 0 1         1 0 0         1 0 0         1 0 0
B | 1 1 0        1 1 0         1 1 0         1 1 0         1 1 0
C | 1 0 0        1 0 0         1 0 0         1 0 1         0 0 1

If an editing path leads to a dead end, you backtrack. If all of the repair paths fail, the initial edit can be rejected.

This approach will generate new, valid matrixes quickly. It will not necessarily produce random outcomes: M1 and M2 will still be highly correlated with each other, a point that will become more directly evident as the size of the matrix grows.

How do you increase the randomness? You mentioned that most cells (99% or more) are zeros. One idea would be to proceed like this: for each 1 in the matrix, set its value to 0 and then repair the matrix using the 4-edit method outlined above. In effect, you would be moving all of the ones to new, random locations.

Here is an illustration. There are probably further speed optimizations in here, but this approach yielded 10 new 600x600 matrixes, at 0.5% density, in 30 seconds or so on my Windows box. Don't know if that's fast enough.

use strict;
use warnings;

# Args: N rows, N columns, density, N iterations.
main(@ARGV);

sub main {
    my $n_iter = pop;
    my $matrix = init_matrix(@_);
    print_matrix($matrix);
    for my $n (1 .. $n_iter){
        warn $n, "\n"; # Show progress.
        edit_matrix($matrix);
        print_matrix($matrix);
    }
}

sub init_matrix {
    # Generate initial matrix, given N of rows, N of cols, and density.
    my ($rows, $cols, $density) = @_;
    my @matrix;
    for my $r (1 .. $rows){
        push @matrix, [ map { rand() < $density ? 1 : 0  } 1 .. $cols ];
    }
    return \@matrix;
}

sub print_matrix {
    # Dump out a matrix for checking.
    my $matrix = shift;
    print "\n";
    for my $row (@$matrix){
        my @vals = map { $_ ? 1 : ''} @$row;
        print join("\t", @vals), "\n";
    }
}

sub edit_matrix {
    # Takes a matrix and moves all of the non-empty cells somewhere else.
    my $matrix = shift;
    my $move_these = cells_to_move($matrix);
    for my $cell (@$move_these){
        my ($i, $j) = @$cell;
        # Move the cell, provided that the cell hasn't been moved
        # already and the subsequent edits don't lead to a dead end.
        $matrix->[$i][$j] = 0
            if $matrix->[$i][$j]
            and other_edits($matrix, $cell, 0, $j);
    }
}

sub cells_to_move {
    # Returns a list of non-empty cells.
    my $matrix = shift;
    my $i = -1;
    my @cells = ();
    for my $row (@$matrix){
        $i ++;
        for my $j (0 .. @$row - 1){
            push @cells, [$i, $j] if $matrix->[$i][$j];
        }
    }
    return \@cells;
}

sub other_edits {
    my ($matrix, $cell, $step, $last_j) = @_;

    # We have succeeded if we've already made 3 edits.
    $step ++;
    return 1 if $step > 3;

    # Determine the roster of next edits to fix the row or
    # column total upset by our prior edit.
    my ($i, $j) = @$cell;
    my @fixes;
    if ($step == 1){
        @fixes = 
            map  { [$i, $_] }
            grep { $_ != $j and not $matrix->[$i][$_] }
            0 .. @{$matrix->[0]} - 1
        ;
        shuffle(\@fixes);
    }
    elsif ($step == 2) {
        @fixes = 
            map  { [$_, $j] }
            grep { $_ != $i and $matrix->[$_][$j] }
            0 .. @$matrix - 1
        ;
        shuffle(\@fixes);
    }
    else {
        # On the last edit, the column of the fix must be
        # the same as the column of the initial edit.
        @fixes = ([$i, $last_j]) unless $matrix->[$i][$last_j];
    }

    for my $f (@fixes){
        # If all subsequent fixes succeed, we are golden: make
        # the current fix and return true.
        if ( other_edits($matrix, [@$f], $step, $last_j) ){
            $matrix->[$f->[0]][$f->[1]] = $step == 2 ? 0 : 1;
            return 1;
        }
    }

    # Failure if we get here.
    return;
}

sub shuffle {
    my $array = shift;
    my $i = scalar(@$array);
    my $j;
    for (@$array ){
        $i --;
        $j = int rand($i + 1);
        @$array[$i, $j] = @$array[$j, $i] unless $i == $j;
    }
}
FM
The algorithm is definitely fast enough! I'm a bit worried about the non-randomness though, I really require matrices that are as random as possible, given the constraints. I will try to generate the matrices as per your algorithm, and check how correlated they are to the original matrix. Problem is, I cannot compare them with truly random matrices of course ;).
Lucas
@Lucas The algorithm, as implemented in the code, is quite random. Every 1 in the original matrix is turned off and 3 other cells are toggled to keep the row/column sums constant. Those 3 fixes are as random as they can be, subject to the constraints of the problem. If there is a non-random aspect, it comes from the fact that the 1's in the original matrix have a higher-than-random chance of being 0's in the new matrix. One way to correct for that bias might be to skip some proportion of 1's in the original matrix -- in other words, leave them as is. What proportion? Perhaps use the density.
FM
@Lucas: If you choose this method, a different way to improve the randomness is to repeat the randomization process several times (a random number of times). Then I think the correlation to the original matrix becomes negligible. It'll make it a little slower, but it's simpler to implement than my suggestion.
Mark Byers
Yes, I've ran the algorithm a couple of thousand times on various random matrices last night, and see that the correlation steadily decreases with each iteration.Nice solution FM!Could you maybe explain how you use the question mark map function in your init matrix? You're generating an array of arrays as the matrix, but I'm unfamiliar with the notation. Thanks a billion for providing a nice solution to my problem :).
Lucas
@Lucas Glad to be of help. It was an interesting problem, one somewhat related to a side project I've been working on lately. Regarding `map { rand() < $density ? 1 : 0 } 1 .. $cols` we are generating a list of 0's and 1's. If `rand()` is less than `$density`, the cell will be a 1; otherwise a 0. Search `perldoc perlop` for the conditional operator for more details on the `? :` syntax. It's basically a mini IF-THEN construct that can be used conveniently within larger expressions.
FM