ansaurus

Question

How do I change this to "idiomatic" Perl?

Answer 1

+8 A:

One simple change is to use for loops like this:

for my $i (0 .. $num_of_rows){
    # Do stuff.
}

For more info, see the Perl documentation on foreach loops and the range operator.

FM 2009-10-23 16:32:27

I have never understood this syntax too well. What exactly does it do?

Jergason 2009-10-23 16:52:24

@Jergason it sets `$i` to each of the values from `0` to `$num_of_rows`, inclusive.

Sinan Ünür 2009-10-23 17:00:14

@Jergason: `..` is the range operator. It returns a list of values (incrementing by ones) from the first value to the second. In this case, it's the list `0, 1, 2...` up to the value of `$num_rows`.

Michael Carman 2009-10-23 17:03:28

It might be easier if you read it as `foreach`. I generally only use `for` for the C-style 3-argument for-loops, and `foreach` for the "loop over a list" form. (see perldoc perlsyn)

Ether 2009-10-23 17:37:42

Answer 2

+7 A:

I have some other comments as well, but here is the first observation:

my $num_of_rows = length($self->{seq1}) + 1;
my $num_of_columns = length($self->{seq2}) + 1;

So $self->{seq1} and $self->{seq2} are strings and you keep accessing individual elements using substr. I would prefer to store them as arrays of characters:

$self->{seq1} = [ split //, $seq1 ];

Here is how I would have written it:

sub create_matrix {
    my $self = shift;

    my $matrix      = $self->{score_matrix};
    my $path_matrix = $self->{path_matrix};

    my $rows = @{ $self->{seq1} };
    my $cols = @{ $self->{seq2} };

    for my $row (0 .. $rows) {
        $matrix->[$row]->[0] =  $row * $self->{gap_cost};
        $path_matrix->[$row]->[0] = 1;
    }

    my $gap_cost = $self->{gap_cost};

    $matrix->[0] = [ map { $_ * $gap_cost } 0 .. $cols ];
    $path_matrix->[0] = [ (-1) x ($cols + 1) ];

    $path_matrix->[0]->[0] = 2;

    for my $row (1 .. $rows) {
        for my $col (1 .. $cols) {
            my $gap1 = $matrix->[$row - 1]->[$col] + $gap_cost;
            my $gap2 = $matrix->[$row]->[$col - 1] + $gap_cost;
            my $match_mismatch =
                $matrix->[$row - 1]->[$col - 1] +
                $self->get_match_score(
                    $self->{seq1}->[$row - 1],
                    $self->{seq2}->[$col - 1]
                );

            my $max = $matrix->[$row]->[$col] =
                max($gap1, $gap2, $match_mismatch);

            $path_matrix->[$row]->[$col] = $max == $gap1
                    ? -1
                    : $max == $gap2
                    ? 1
                    : 0;
            }
        }
    }

Sinan Ünür 2009-10-23 16:33:11

This is because the algorithm requires a 2d array one row and column bigger than the lengths of the two sequences.

Jergason 2009-10-23 16:34:17

I liked the idea of splitting into arrays of characters. The indicies are horribly unclear, aren't they? If seq1 has i characters and seq2 has j characters, then the matrix needs to have i + 1 rows and j + 1 columns.The score at each location in the matrix is the maximum of the scores coming from the location directly above + a gap cost, the upper-left neighbor + a score for a match or mismatch at the current location and the location directly to the right + a gap cost.

Jergason 2009-10-23 16:48:12

Looping over `0 .. $num_of_rows`, you do not need to add 1. While I am at it, I am going to recommend changing the variable names to `$rows` and `$cols`, respectively.

Sinan Ünür 2009-10-23 16:53:06

Why all the extra dereferencing arrows? `$matrix->[$row]->[0]` is equivalent to `$matrix->[$row][0]`.

daotoad 2009-10-23 22:18:24

@daotoad: Well, that is a style choice.

Sinan Ünür 2009-10-23 23:07:54

Answer 3

+5 A:

The majority of your code is manipulating 2D arrays. I think the biggest improvement would be switching to using PDL if you want to do much stuff with arrays, particularly if efficiency is a concern. It's a Perl module which provides excellent array support. The underlying routines are implemented in C for efficiency so it's fast too.

ire_and_curses 2009-10-23 16:34:16

Answer 4

+7 A:

Instead of dereferencing your two-dimensional arrays like this:

$$path_matrix[0][0] = 2;

do this:

$path_matrix->[0][0] = 2;

Also, you're doing a lot of if/then/else statements to match against particular subsequences: this could be better written as given statements (perl5.10's equivalent of C's switch). Read about it at perldoc perlsyn:

given ($matrix->[$row][$column])
{
    when ($seq1_gap)       { $path_matrix->[$row][$column] = -1; }
    when ($match_mismatch) { $path_matrix->[$row][$column] = 0; }
    when ($seq2_gap)       { $path_matrix->[$row][$column] = 1; }
}

Ether 2009-10-23 16:36:25

I had not heard of given before. That is neat.

Jergason 2009-10-23 18:30:02

The `$_ ==` part isn't required in the `when` blocks, is it?

Rob Kennedy 2009-10-23 20:09:16

PS. My life has never been the same since I incorporated "truthiness" into my vocabulary. Thank you Stephen Colbert!

Ether 2009-10-23 20:43:16

Ether: argument of given is aliased by $_. $_ is a left argument of implicit smart match in when. See 'when ("foo")' in perlsyn, next example.

Alexandr Ciornii 2009-10-24 10:25:42

Thanks! I'm pretty new to 5.10 and haven't used the smart match operator yet.

Ether 2009-10-24 18:44:49

Incorrect comment deleted and code corrected; thanks again Alexandr.

Ether 2009-10-24 18:49:22

Answer 5

+9 A:

You're getting several suggestions regarding syntax, but I would also suggest a more modular approach, if for no other reason that code readability. It's much easier to come up to speed on code if you can perceive the big picture before worrying about low-level details.

Your primary method might look like this.

sub create_matrix {
    my $self = shift;
    $self->create_2d_array_of_scores;
    $self->fill_out_first_row;
    $self->fill_out_other_rows;
}

And you would also have several smaller methods like this:

n_of_rows
n_of_cols
create_2d_array_of_scores
fill_out_first_row
fill_out_other_rows

And you might take it even further by defining even smaller methods -- getters, setters, and so forth. At that point, your middle-level methods like create_2d_array_of_scores would not directly touch the underlying data structure at all.

sub matrix      { shift->{score_matrix} }
sub gap_cost    { shift->{gap_cost}     }

sub set_matrix_value {
    my ($self, $r, $c, $val) = @_;
    $self->matrix->[$r][$c] = $val;
}

# Etc.

FM 2009-10-23 17:27:00

+1 for promoting self-documenting code: the new idiom

Ewan Todd 2009-10-23 18:13:19

Answer 6

A:

I would always advise to look at CPAN for previous solutions or examples of how to do things in Perl. Have you looked at Algorithm::NeedlemanWunsch?

The documentation to this module includes an example for matching DNA sequences. Here is an example using the similarity matrix from wikipedia.

#!/usr/bin/perl -w
use strict;
use warnings;
use Inline::Files;                 #multiple virtual files inside code
use Algorithm::NeedlemanWunsch;    # refer CPAN - good style guide

# Read DNA sequences
my @a = read_DNA_seq("DNA_SEQ_A");
my @b = read_DNA_seq("DNA_SEQ_B");

# Read Similarity Matrix (held as a Hash of Hashes)
my %SM = read_Sim_Matrix();

# Define scoring based on "Similarity Matrix" %SM
sub score_sub {
    if ( !@_ ) {
        return -3;                 # gap penalty same as wikipedia)
    }
    return $SM{ $_[0] }{ $_[1] };    # Similarity Value matrix
}

my $matcher = Algorithm::NeedlemanWunsch->new( \&score_sub, -3 );
my $score = $matcher->align( \@a, \@b, { align => \&check_align, } );

print "\nThe maximum score is $score\n";

sub check_align {
    my ( $i, $j ) = @_;              # @a[i], @b[j]
    print "seqA pos: $i, seqB pos: $j\t base \'$a[$i]\'\n";
}

sub read_DNA_seq {
    my $source = shift;
    my @data;
    while (<$source>) {
        push @data, /[ACGT-]{1}/g;
    }
    return @data;
}

sub read_Sim_Matrix {

    #Read DNA similarity matrix (scores per Wikipedia)
    my ( @AoA, %HoH );
    while (<SIMILARITY_MATRIX>) {
        push @AoA, [/(\S+)+/g];
    }

    for ( my $row = 1 ; $row < 5 ; $row++ ) {
        for ( my $col = 1 ; $col < 5 ; $col++ ) {
            $HoH{ $AoA[0][$col] }{ $AoA[$row][0] } = $AoA[$row][$col];
        }
    }
    return %HoH;
}

__DNA_SEQ_A__
A T G T A G T G T A T A G T
A C A T G C A
__DNA_SEQ_B__
A T G T A G T A C A T G C A
__SIMILARITY_MATRIX__
-  A  G  C  T
A  10  -1  -3  -4
G  -1  7  -5  -3
C  -3  -5  9  0
T  -4  -3  0  8

And here is some sample output:

seqA pos: 7, seqB pos: 2  base 'G'
seqA pos: 6, seqB pos: 1  base 'T'
seqA pos: 4, seqB pos: 0  base 'A'

The maximum score is 100

heferav 2009-10-25 11:32:24

I have looked at it, but it was for an assignment that required us to implement the algorithm ourselves.

Jergason 2009-10-25 21:25:49

ansaurus

tags:

views:

answers:

How do I change this to "idiomatic" Perl?

related questions