views:

210

answers:

2

I have written the following script to search for a motif(substring) in a protein sequences(strings). I am beginner and writing this has been tough for me. I have two questions regarding the same: 1. Errors: The following script has few errors. I have been at it for quite sometime now but have not figured out what and why? 2. The following script has been written to search for one motif(substring) in protein sequences(strings). My next task involves searching for multiple motifs in a specific order (ex: motif1 motif2 motif 3 motif4 this order cannot be changed) in the same protein sequences(strings)

        use strict;
        use warnings;

        my @file_data=();
        my $motif ='';
        my $protein_seq='';
        my $h= '[VLIM]';   
        my $s= '[AG]';
        my $x= '[ARNDCEQGHILKMFPSTWYV]';
        my $regexp = "($h){4}D($x){4}D"; #motif to be searched is hhhhDxxxxD
        my @locations=();

        @file_data= get_file_data("seq.txt");

        $protein_seq= extract_sequence(@file_data); 

    #searching for a motif hhhhDxxxxD in each protein sequence in the give file

        foreach my $line(@file_data){
        if ($motif=~ /$regexp/){
        print "found motif \n\n";
        }
        else {
        print "not found \n\n";
        }
        }
#recording the location/position of motif to be outputed

        @locations= match_position($regexp,$seq);
        if (@locations){ 
        print "Searching for motifs $regexp \n";
        print "Catalytic site is at location:\n";
        }
        else{
        print "motif not found \n\n";
        }
        exit;

        sub get_file_data{
        my ($filename)=@_;
        use strict;
        use warnings;
        my $sequence='';

        foreach my $line(@file_data){

        if ($line=~ /^\s*$/){
        next;
                }
        elsif ($line=~ /^\s*#/){
        next;
        }
        elsif ($line=~ /^>/){
        next;
        }
        else {
        $sequence.=$line;
        }
        }
        $sequence=~ s/\s//g;
        return $sequence;
        }

        sub(match_positions) {
        my ($regexp, $sequence)=@_;
        use strict;
        my @position=();
        while ($sequence=~ /$regexp/ig){
        push (@position, $-[0]);
        }
        return @position;
        }
+2  A: 
  1. First of all, the keyword is elsif, second of all you don't need it. You can compress the code in the get_file_data loop to:

    next if $line =~ /^\s*$|^>/; 
    $sequence .= $line;
    

    As long as you're going to use regular expressions -- unless too unwieldy -- you might as well search for all the cases that you want to ignore. If you find that actual second case, you can add it as an another alternation. Say you wanted to exclude lines that begin with #-. Then you would just add it in like so: /^\s*$|^>|^#-/

  2. Another thing is that my position=(); needs to have the @ sigil, before position, or otherwise, perl thinks you're trying to something tricky with a call to position().

  3. You need the following changes:

     my $h= '[VLIM]';   
     my $s= '[AG]';
     my $x= '[ARNDCEQGHILKMFPSTWYV]';
    

    Otherwise, you're just assigning to $h to an array reference with a single slot populated by whatever would be returned from the sub VLIM.

  4. Third, don't use $&. Replace pos($sequence)-length($&)+1

    push @positions, $-[0];
    

    or better yet, use English:

    use English qw<-no_match_vars>;
    ...
    push @positions, $LAST_MATCH_START[0];
    
  5. I would suggest the following for the file reading:

    use IO::File;
    ...
    # Use real file handles
    my $fh = IO::File->new( "<seq.txt" );
    get_file_data( $fh ); # They can be passed
    ...
    sub get_file_data{
        my $file_handle = shift; 
        ...
        # while loop conserves resources
        while ( my $line = <$file_handle> ) { 
            next if $line =~ /^\s*$|^>/;
            $sequence .= $line;
        }
    
  6. A suggestion for going forward -- it helps me immensely:

    A. Install Smart::Comments

    B. Put this at the top of your script:

     use Smart::Comments;
    

    C. Every time you're not sure what you've got so far, like if you wanted to see the current contents of $sequence, place the following in the code:

    ### $sequence
    exit 0;
    

    just show it and exit. When you get too many printouts, delete them.

Axeman
@axeman: changes made.
shubster
A: 
  • Use "elsif" instead of "elseif".
  • Are @file_data and @fasta_file_data supposed to be the same thing?

In match_positions:

  • Remove the parenthesis around the sub name.
  • Change "my position" to "my @position".
  • Change the pattern from /regexp/ig to /$regexp/ig.
Nathan Kitchen
@Nathan Kitchen: changes made
shubster
yes they are the same and I changed it.
shubster
I don't think they should be. @fasta... holds the lines from the input, but the other holds the other lines. You seem to passing get_file_data the name of the file, but then just jumping to the point where you've read it into an array. Please see my #5, for how to make that a reality.
Axeman