views:

401

answers:

3

The following script is for finding one motif in protein sequence.

use strict;
use warnings;

my @file_data=();
my $protein_seq='';
my $h= '[VLIM]';   
my $s= '[AG]';
my $x= '[ARNDCEQGHILKMFPSTWYV]';
my $regexp = "($h){4}D($x){4}D"; #motif to be searched is hhhhDxxxxD
my @locations=();

@file_data= get_file_data("seq.txt");

$protein_seq= extract_sequence(@file_data); 

#searching for a motif hhhhDxxxxD in each protein sequence in the give file

foreach my $line(@file_data){
    if ($motif=~ /$regexp/){
        print "found motif \n\n";
      } else {
        print "not found \n\n";
    }
}
#recording the location/position of motif to be outputed

@locations= match_position($regexp,$seq);
if (@locations){ 
    print "Searching for motifs $regexp \n";
    print "Catalytic site is at location:\n";
  } else {
    print "motif not found \n\n";
}
exit;

sub get_file_data{
    my ($filename)=@_;
    use strict;
    use warnings;
    my $sequence='';

    foreach my $line(@fasta_file_data){
        if ($line=~ /^\s*(#.*)?|^>/{
            next;
          } 
        else {
            $sequence.=$line;
        }
    }
    $sequence=~ s/\s//g;
    return $sequence;
}

sub(match_positions) {
    my ($regexp, $sequence)=@_;
    use strict;
    my @position=();
    while ($sequence=~ /$regexp/ig){
        push (@position, $-[0]);
    }
    return @position;
}

I am not sure how to extend this for finding multiple motifs (in a fixed order i.e motif1, motif2, motif3) in a given file containing a protein sequence.

+1  A: 

You could simply use alternations (delimited by |) of the sequences. That way each sequence the regex engine can match it will.

/($h{4}D$x{4}D|$x{1,4}A{1,2}$s{2})/

Then you can test this match by looking at $1.

Axeman
A: 

If you want to find these motifs in a particular order but perhaps separated somewhat, you could use something like:

/$h{4}D$x{4}D .* $s{4}D$q{4}/x

(/x allows for whitespace in the regex, .* matches zero or more characters)

bdonlan
A: 

are you JUST looking for substrings? if that's the case a couple of regexes will probably get you where you need to go. but these kinds of problems tend to escalate quickly, most likely in next week's problem set. if the latter is the case, and you're going to need to do comparisons you probably need to start looking into dynamic alignment algorithms, minimum edit distance, viterbi alignment, hmms and the like.

also, if you are dealing with large input files, you might look into pre-compiling your regexes for a nice speed boost,

perl pre-compiled regexes

blackkettle
No, he's not just looking for substrings. Take a look at his regex classes.
Axeman