views:

820

answers:

3

I need a bit of help with is this code. I know the sections that should be recursive, or at least I think I do but am not sure how to implement it. I am trying to implement a path finding program from an alignment matrix that will find multiple routes back to the zero value. For example if you excute my code and insert CGCA as the first sequence and CACGTAT as the second sequence, and 1, 0, and -1 as the match, mismatch and gap scores. The program gives off the path as HDHHDD and the aligment as

CACGTAT

CGC--A-.

However there are more possible paths and aligments then this, except I don't know how many. What I want to do is have a piece of my code loop back on itself and find other paths and alignments, using the same code as the first time around, until it runs out of possible alignments. The best way I found on the net to do this is recursion, except no one can explain how to do it. In this case there should be two more paths and aligmennts HDDDHHD and CACGTAT, and C--GCA-, and. HDDDDHH , CACGTAT AND --CGCA-. I just don't know how to code to perform this task.

# Implementation of Needleman and Wunsch Algorithm

my($seq1, $len1, $seq2, $len2, $data, @matrix, $i, $j, $x, $y, $val1, $val2);
my($val3, $pathrow, $pathcol, $seq1loc, $seq2loc, $gapscore, $matchscore, $mismatchscore);

#first obtain the data from the user. 
print "Please enter the first sequence for comaprsion\n";
$seq1=<STDIN>;
chomp $seq1;

print "Please enter the second sequence for comparsion\n";
$seq2=<STDIN>;
chomp $seq2;


# adding extra characters so sequences align with matrix
# saves some calculations later on
$seq1 = " " . $seq1;
$seq2 = " " . $seq2;
$len1 = length($seq1);
$len2 = length($seq2);

print "Enter the match score: ";
$matchscore=<STDIN>;
chomp $matchscore;

print "Enter the mismatch score: ";
$mismatchscore=<STDIN>;
chomp $mismatchscore;

print "Enter the gap score: ";
$gapscore=<STDIN>;
chomp $gapscore;

# declare a two dimensional array and initialize to spaces
# array must contain one extra row and one extra column
@matrix = ();  
for($i = 0; $i < $len1; $i++){  
   for($j = 0; $j < $len2; $j++){  
      $matrix[$i][$j] = ' ';  
   }  
}  

# initialize 1st row and 1st column of matrix  
$matrix[0][0] = 0;  
for ($i = 1; $i < $len1; $i ++){  
    $matrix[$i][0] = $matrix[$i-1][0] + $gapscore;  
}  
for ($i = 1; $i < $len2; $i ++){  
    $matrix[0][$i] = $matrix[0][$i-1] + $gapscore;  
}  


# STEP 1:
# Fill in rest of matrix using the following rules:
# determine three possible values for matrix[x][y]
# value 1 = add gap score to matrix[x][y-1]
# value 2 = add gap score to matrix[x-1][y]
# value 3 = add match score or mismatch score to 
#           matrix[x-1][y-1] depending on nucleotide 
#           match for position x of $seq1 and position y
#           of seq2
# place the largest of the three values in matrix[x][y]
#
# Best alignment score appears in matrix[$len1][$len2].

for($x = 1; $x < $len1; $x++){  
   for($y = 1; $y < $len2; $y++){  
 $val1 = $matrix[$x][$y-1] + $gapscore;  
 $val2 = $matrix[$x-1][$y] + $gapscore;  
 if (substr($seq1, $x, 1) eq substr($seq2, $y, 1)){  
           $val3 = $matrix[$x-1][$y-1] + $matchscore;  
 }  
 else{  
    $val3 = $matrix[$x-1][$y-1] + $mismatchscore;  
 }  
 if (($val1 >= $val2) && ($val1 >= $val3)){  
    $matrix[$x][$y] = $val1;  
 }  
 elsif (($val2 >= $val1) && ($val2 >= $val3)){  
    $matrix[$x][$y] = $val2;  
 }  
 else{  
    $matrix[$x][$y] = $val3;  
 }  
   }  
}  

# Display scoring matrix  
print "MATRIX:\n";   
for($x = 0; $x < $len1; $x++){  
   for($y = 0; $y < $len2; $y++){  
 print "$matrix[$x][$y] ";  
   }  
   print "\n";  
}  
print "\n";    

# STEP 2:  
# Begin at matrix[$len1][$len2] and find a path to   
# matrix[0][0].  
# Build string to hold path pattern by concatenating either   
# 'H' (current cell produced by cell horizontally to left),   
# 'D' (current cell produced by cell on diagonal),   
# 'V' (current cell produced by cell vertically above)
# ***This is were I need help I need this code to be recursive, so I can find more then one path***

$pathrow = $len1-1;  
$pathcol = $len2-1;  

while (($pathrow != 0) || ($pathcol != 0)){  
    if ($pathrow == 0){  
       # must be from cell to left  
       $path = $path . 'H';  
       $pathcol = $pathcol - 1;  
    }  
    elsif ($pathcol == 0){  
       # must be from cell above  
       $path = $path . 'V';  
       $pathrow = $pathrow - 1;  
    }  
    # could be from any direction  
    elsif  (($matrix[$pathrow][$pathcol-1] + $gapscore) == $matrix[$pathrow][$pathcol]){  
       # from left  
       $path = $path . 'H';  
       $pathcol = $pathcol - 1;  
    }  
    elsif (($matrix[$pathrow-1][$pathcol] + $gapscore) == $matrix[$pathrow][$pathcol]){  
       # from above  
       $path = $path . 'V';  
       $pathrow = $pathrow - 1;  
    }   
    else{  
       # must be from diagonal  
       $path = $path . 'D';  
       $pathrow = $pathrow - 1;  
       $pathcol = $pathcol - 1;  
    }  


}   



print "Path is $path\n";   

# STEP 3:
# Determine alignment pattern by reading path string 
# created in step 2.
# Create two string variables ($alignseq1 and $alignseq2) to hold 
# the sequences for alignment.
# Reading backwards from $seq1, $seq2 and path string, 
#   if string character is 'D', then 
#      concatenate to front of $alignseq1, last char in $seq1 
#      and to the front of $alignseq2, last char in $seq2
#   if string character is 'V', then 
#      concatenate to front of $alignseq1, last char in $seq1 
#      and to the front of $alignseq2, the gap char
#   if string character is 'H', then 
#      concatenate to front of $alignseq1 the gap char
#      and to the front of $alignseq2, last char in $seq2
# Continue process until path string has been traversed.
# Result appears in string $alignseq1 and $seq2
***#I need this code to be recursive as well.***

$seq1loc = $len1-1;  
$seq2loc = $len2-1;  
$pathloc = 0;  
print length($path);  
while ($pathloc < length($path)){  
   if (substr($path, $pathloc, 1) eq 'D'){  
      $alignseq1 = substr($seq1, $seq1loc, 1) . $alignseq1;  
      $alignseq2 = substr($seq2, $seq2loc, 1) . $alignseq2;  
      $seq1loc--;  
      $seq2loc--;  
   }  
   elsif (substr($path, $pathloc, 1) eq 'V'){  
      $alignseq1 = substr($seq1, $seq1loc, 1) . $alignseq1;  
      $alignseq2 = '-' . $alignseq2;  
      $seq1loc--;  
   }    
   else{  # must be an H  
      $alignseq1 = '-' . $alignseq1;  
      $alignseq2 = substr($seq2, $seq2loc, 1) . $alignseq2;  
      $seq2loc--;  
   }    
   $pathloc++;  
}  

print "\nAligned Sequences:\n";  
print "$alignseq2 \n";  
print "$alignseq1 \n";  

# statement may be needed to hold output screen  
print "Press any key to exit program";  
$x = <STDIN>;

If anyone is wondering this is a needleman-wunsch algorithm. Any help here would be greatly apperciated.

+7  A: 

I can't provide an answer, because I don't understand exactly what you are try to do, but I can offer some general advice.

Start organizing your code into discrete subroutines that perform narrowly defined tasks. In addition, the subroutines that implement your central algorithms should not be oriented toward receiving input from the keyboard and producing output to the screen; rather they should receive input as arguments and return their results. If there is a need for user input or screen output, those tasks should be in separate subroutines, not comingled with your primary algorithms.

A first (and partial) step down that path is to take you entire program, enclose it in a subroutine definition, and then call the subroutine with the required arguments. Instead of printing its key results, the subroutine should return them -- specifically, a reference to @matrix along with the values for $path, $alignseq1, $alignseq2.

sub NW_algo {
    my ($seq1, $seq2, $matchscore, $mismatchscore, $gapscore) = @_;

    # The rest of your code here, but with all print
    # statements and <STDIN> inputs commented out.

    return \@matrix, $path, $alignseq1, $alignseq2;
}

my(@return_values) = NW_algo('CGCA', 'CACGTAT', 1, 0, -1);

Print_matrix($return_values[0]);

sub Print_matrix {
    for my $m ( @{$_[0]} ){
        print join(' ', @$m), "\n";
    }
}

At this point, you'll have an algorithm that can be invoked by other code, making it easier to test and debug your program going forward. For example, you could define various sets of input data and run NW_algo() on each set. Only then will it be possible to think about recursion or other techniques.

FM
+5  A: 

Since Needleman-Wunsch is a dynamic-programming algorithm, most of the work is already done by the time you compute your DP matrix. Once you have your DP matrix, you're supposed to backtrack through the matrix to find the optimal alignment. The problem is a bit like taxicab geometry except that you can move diagonally. Essentially, when you need to backtrack through the matrix, instead of choosing between going up, left, or diagonally, you do all three by making three recursive calls, and each of those call themselves for each of up, left, or diagonally, until you reach your starting point. The path traced by each strand of recursion will draw out each alignment.

EDIT: So basically you need to put Step 2 in a subprocedure (that takes position and the path traced so far), so it can call itself over and over. Of course, after you define the procedure you need to make one call to it to actually start the tracing process:

sub tracePaths {
    $x = shift;
    $y = shift;
    $pathSoFar = shift; # assuming you're storing your path as a string
    #
    # ... do some work ...
    #
    tracePaths($x - 1, $y, $pathSoFar . something);
    tracePaths($x, $y - 1, $pathSoFar . somethingelse);
    tracePaths($x - 1, $y - 1, $pathSoFar . somethingelselse);
    #
    #
    #
    if(reached the end) return $pathSoFar;
}
# launch the recursion
tracePaths(beginningx, beginningy, "");
erjiang
+4  A: 

This doesn't speak specifically to your problem, but you should maybe check out the book Higher Order Perl. It goes over how to use a lot of higher-level techniques (such as recursion).

Chris Simmons