views:

320

answers:

6

I am beginning to delve deeper into Perl, but am having trouble writing "Perl-ly" code instead of writing C in Perl. How can I change the following code to use more Perl idioms, and how should I go about learning the idioms?

Just an explanation of what it is doing: This routine is part of a module that aligns DNA or amino acid sequences(using Needelman-Wunch if you care about such things). It creates two 2d arrays, one to store a score for each position in the two sequences, and one to keep track of the path so the highest-scoring alignment can be recreated later. It works fine, but I know I am not doing things very concisely and clearly.

edit: This was for an assignment. I completed it, but want to clean up my code a bit. The details on implementing the algorithm can be found on the class website if any of you are interested.

sub create_matrix {
    my $self = shift;
    #empty array reference
    my $matrix = $self->{score_matrix};
    #empty array ref
    my $path_matrix = $self->{path_matrix};
    #$seq1 and $seq2 are strings set previously
    my $num_of_rows = length($self->{seq1}) + 1;
    my $num_of_columns = length($self->{seq2}) + 1;

    #create the 2d array of scores
    for (my $i = 0; $i < $num_of_rows; $i++) {
     push(@$matrix, []);
     push(@$path_matrix, []);
     $$matrix[$i][0] = $i * $self->{gap_cost};
     $$path_matrix[$i][0] = 1;
    }

    #fill out the first row
    for (my $i = 0; $i < $num_of_columns; $i++) {
     $$matrix[0][$i] = $i * $self->{gap_cost};
     $$path_matrix[0][$i] = -1;
    }
    #flag to signal end of traceback
    $$path_matrix[0][0] = 2;
    #double for loop to fill out each row
    for (my $row = 1; $row < $num_of_rows; $row++) {
     for (my $column = 1; $column < $num_of_columns; $column++) {
      my $seq1_gap = $$matrix[$row-1][$column] + $self->{gap_cost};
      my $seq2_gap = $$matrix[$row][$column-1] + $self->{gap_cost};
      my $match_mismatch = $$matrix[$row-1][$column-1] + $self->get_match_score(substr($self->{seq1}, $row-1, 1), substr($self->{seq2}, $column-1, 1));
      $$matrix[$row][$column] = max($seq1_gap, $seq2_gap, $match_mismatch);

      #set the path matrix
      #if it was a gap in seq1, -1, if was a (mis)match 0 if was a gap in seq2 1
      if ($$matrix[$row][$column] == $seq1_gap) {
       $$path_matrix[$row][$column] = -1;
      }
      elsif ($$matrix[$row][$column] == $match_mismatch) {
       $$path_matrix[$row][$column] = 0;
      }
      elsif ($$matrix[$row][$column] == $seq2_gap) {
       $$path_matrix[$row][$column] = 1;
      }
     }
    }
}
+8  A: 

One simple change is to use for loops like this:

for my $i (0 .. $num_of_rows){
    # Do stuff.
}

For more info, see the Perl documentation on foreach loops and the range operator.

FM
I have never understood this syntax too well. What exactly does it do?
Jergason
@Jergason it sets `$i` to each of the values from `0` to `$num_of_rows`, inclusive.
Sinan Ünür
@Jergason: `..` is the range operator. It returns a list of values (incrementing by ones) from the first value to the second. In this case, it's the list `0, 1, 2...` up to the value of `$num_rows`.
Michael Carman
It might be easier if you read it as `foreach`. I generally only use `for` for the C-style 3-argument for-loops, and `foreach` for the "loop over a list" form. (see perldoc perlsyn)
Ether
+7  A: 

I have some other comments as well, but here is the first observation:

my $num_of_rows = length($self->{seq1}) + 1;
my $num_of_columns = length($self->{seq2}) + 1;

So $self->{seq1} and $self->{seq2} are strings and you keep accessing individual elements using substr. I would prefer to store them as arrays of characters:

$self->{seq1} = [ split //, $seq1 ];

Here is how I would have written it:

sub create_matrix {
    my $self = shift;

    my $matrix      = $self->{score_matrix};
    my $path_matrix = $self->{path_matrix};

    my $rows = @{ $self->{seq1} };
    my $cols = @{ $self->{seq2} };

    for my $row (0 .. $rows) {
        $matrix->[$row]->[0] =  $row * $self->{gap_cost};
        $path_matrix->[$row]->[0] = 1;
    }

    my $gap_cost = $self->{gap_cost};

    $matrix->[0] = [ map { $_ * $gap_cost } 0 .. $cols ];
    $path_matrix->[0] = [ (-1) x ($cols + 1) ];

    $path_matrix->[0]->[0] = 2;

    for my $row (1 .. $rows) {
        for my $col (1 .. $cols) {
            my $gap1 = $matrix->[$row - 1]->[$col] + $gap_cost;
            my $gap2 = $matrix->[$row]->[$col - 1] + $gap_cost;
            my $match_mismatch =
                $matrix->[$row - 1]->[$col - 1] +
                $self->get_match_score(
                    $self->{seq1}->[$row - 1],
                    $self->{seq2}->[$col - 1]
                );

            my $max = $matrix->[$row]->[$col] =
                max($gap1, $gap2, $match_mismatch);

            $path_matrix->[$row]->[$col] = $max == $gap1
                    ? -1
                    : $max == $gap2
                    ? 1
                    : 0;
            }
        }
    }
Sinan Ünür
This is because the algorithm requires a 2d array one row and column bigger than the lengths of the two sequences.
Jergason
I liked the idea of splitting into arrays of characters. The indicies are horribly unclear, aren't they? If seq1 has i characters and seq2 has j characters, then the matrix needs to have i + 1 rows and j + 1 columns.The score at each location in the matrix is the maximum of the scores coming from the location directly above + a gap cost, the upper-left neighbor + a score for a match or mismatch at the current location and the location directly to the right + a gap cost.
Jergason
Looping over `0 .. $num_of_rows`, you do not need to add 1. While I am at it, I am going to recommend changing the variable names to `$rows` and `$cols`, respectively.
Sinan Ünür
Why all the extra dereferencing arrows? `$matrix->[$row]->[0]` is equivalent to `$matrix->[$row][0]`.
daotoad
@daotoad: Well, that is a style choice.
Sinan Ünür
+5  A: 

The majority of your code is manipulating 2D arrays. I think the biggest improvement would be switching to using PDL if you want to do much stuff with arrays, particularly if efficiency is a concern. It's a Perl module which provides excellent array support. The underlying routines are implemented in C for efficiency so it's fast too.

ire_and_curses
+7  A: 

Instead of dereferencing your two-dimensional arrays like this:

$$path_matrix[0][0] = 2;

do this:

$path_matrix->[0][0] = 2;

Also, you're doing a lot of if/then/else statements to match against particular subsequences: this could be better written as given statements (perl5.10's equivalent of C's switch). Read about it at perldoc perlsyn:

given ($matrix->[$row][$column])
{
    when ($seq1_gap)       { $path_matrix->[$row][$column] = -1; }
    when ($match_mismatch) { $path_matrix->[$row][$column] = 0; }
    when ($seq2_gap)       { $path_matrix->[$row][$column] = 1; }
}
Ether
I had not heard of given before. That is neat.
Jergason
The `$_ ==` part isn't required in the `when` blocks, is it?
Rob Kennedy
PS. My life has never been the same since I incorporated "truthiness" into my vocabulary. Thank you Stephen Colbert!
Ether
Ether: argument of given is aliased by $_. $_ is a left argument of implicit smart match in when. See 'when ("foo")' in perlsyn, next example.
Alexandr Ciornii
Thanks! I'm pretty new to 5.10 and haven't used the smart match operator yet.
Ether
Incorrect comment deleted and code corrected; thanks again Alexandr.
Ether
+9  A: 

You're getting several suggestions regarding syntax, but I would also suggest a more modular approach, if for no other reason that code readability. It's much easier to come up to speed on code if you can perceive the big picture before worrying about low-level details.

Your primary method might look like this.

sub create_matrix {
    my $self = shift;
    $self->create_2d_array_of_scores;
    $self->fill_out_first_row;
    $self->fill_out_other_rows;
}

And you would also have several smaller methods like this:

n_of_rows
n_of_cols
create_2d_array_of_scores
fill_out_first_row
fill_out_other_rows

And you might take it even further by defining even smaller methods -- getters, setters, and so forth. At that point, your middle-level methods like create_2d_array_of_scores would not directly touch the underlying data structure at all.

sub matrix      { shift->{score_matrix} }
sub gap_cost    { shift->{gap_cost}     }

sub set_matrix_value {
    my ($self, $r, $c, $val) = @_;
    $self->matrix->[$r][$c] = $val;
}

# Etc.
FM
+1 for promoting self-documenting code: the new idiom
Ewan Todd
A: 

I would always advise to look at CPAN for previous solutions or examples of how to do things in Perl. Have you looked at Algorithm::NeedlemanWunsch?

The documentation to this module includes an example for matching DNA sequences. Here is an example using the similarity matrix from wikipedia.

#!/usr/bin/perl -w
use strict;
use warnings;
use Inline::Files;                 #multiple virtual files inside code
use Algorithm::NeedlemanWunsch;    # refer CPAN - good style guide

# Read DNA sequences
my @a = read_DNA_seq("DNA_SEQ_A");
my @b = read_DNA_seq("DNA_SEQ_B");

# Read Similarity Matrix (held as a Hash of Hashes)
my %SM = read_Sim_Matrix();

# Define scoring based on "Similarity Matrix" %SM
sub score_sub {
    if ( !@_ ) {
        return -3;                 # gap penalty same as wikipedia)
    }
    return $SM{ $_[0] }{ $_[1] };    # Similarity Value matrix
}

my $matcher = Algorithm::NeedlemanWunsch->new( \&score_sub, -3 );
my $score = $matcher->align( \@a, \@b, { align => \&check_align, } );

print "\nThe maximum score is $score\n";

sub check_align {
    my ( $i, $j ) = @_;              # @a[i], @b[j]
    print "seqA pos: $i, seqB pos: $j\t base \'$a[$i]\'\n";
}

sub read_DNA_seq {
    my $source = shift;
    my @data;
    while (<$source>) {
        push @data, /[ACGT-]{1}/g;
    }
    return @data;
}

sub read_Sim_Matrix {

    #Read DNA similarity matrix (scores per Wikipedia)
    my ( @AoA, %HoH );
    while (<SIMILARITY_MATRIX>) {
        push @AoA, [/(\S+)+/g];
    }

    for ( my $row = 1 ; $row < 5 ; $row++ ) {
        for ( my $col = 1 ; $col < 5 ; $col++ ) {
            $HoH{ $AoA[0][$col] }{ $AoA[$row][0] } = $AoA[$row][$col];
        }
    }
    return %HoH;
}

__DNA_SEQ_A__
A T G T A G T G T A T A G T
A C A T G C A
__DNA_SEQ_B__
A T G T A G T A C A T G C A
__SIMILARITY_MATRIX__
-  A  G  C  T
A  10  -1  -3  -4
G  -1  7  -5  -3
C  -3  -5  9  0
T  -4  -3  0  8

And here is some sample output:

seqA pos: 7, seqB pos: 2  base 'G'
seqA pos: 6, seqB pos: 1  base 'T'
seqA pos: 4, seqB pos: 0  base 'A'

The maximum score is 100
heferav
I have looked at it, but it was for an assignment that required us to implement the algorithm ourselves.
Jergason