tags:

views:

246

answers:

5

The following is the script for finding consecutive substrings in strings.

use strict;
use warnings;

my $file="Sample.txt";
open(DAT, $file) || die("Could not open file!");

#worry about these later
#my $regexp1 = "motif1";
#my $regexp2 = "motif2";
#my $regexp3 = "motif3";
#my $regexp4 = "motif4";

my $sequence;

while (my $line = <DAT>) {
    if ($line=~ /(HDWFLSFKD)/g){
        {
        print "its found index location: ",
        pos($line), "-",  pos($line)+length($1), "\n";        
        }
        if ($line=~ /(HD)/g){
                print "motif found and its locations is: \n";
                pos($line), "-", pos($line)+length($1), "\n\n";
                }
                if ($line=~ /(K)/g){
                        print "motif found and its location is: \n";
                        pos($line), "-",pos($line)+length($1), "\n\n";
                        }
                        if ($line=~ /(DD)/g){
                                print "motif found and its location is: \n";
                                pos($line), "-", pos($line)+length($1), "\n\n";
                                }
}else {
        $sequence .= $line;
        print "came in else\n";
    }
}

It matches substring1 with string and prints out position where substring1 matched. The problem lies in finding the rest of the substrings. For substrings2 it starts again from the beginning of the string (instead of starting from the position where substring1 was found). The problem is that every time it calculates position it starts from the beginning of string instead of starting from the position of the previously found substring. Since substrings are consecutive substring1, substring2, substring3, substring4, their positions have to occur after the previous respectively.

+1  A: 

You really should read

You need the special variables @- and @+ if you need the positions. No need to try to compute them yourself.

#!/usr/bin/perl

use strict;
use warnings;

use List::MoreUtils qw( each_array );

my $source = 'AAAA   BBCCC   DD  E      FFFFF';
my $pattern = join '\s*', map { "($_+)" } qw( A B C D E F );



if ( $source =~ /$pattern/ ) {
    my $it = each_array @-, @+;

    $it->(); # discard overall match information;

    while ( my ($start, $end) = $it->() ) {
        printf "Start: %d - Length: %d\n", $start, $end - $start;
    }
}

Start: 0 - Length: 4
Start: 7 - Length: 2
Start: 9 - Length: 3
Start: 15 - Length: 2
Start: 19 - Length: 1
Start: 26 - Length: 5
Sinan Ünür
A: 

The result of a construct like

$line=~ /(HD)/g

is a list. Use while to step through the hits.

fgm
The OP needs the start positions as well as the lengths.
Sinan Ünür
A: 

To match where the last match left off, use \G. perldoc perlre says (but consult your own installation's version's manual first):

The "\G" assertion can be used to chain global matches (using "m//g"), as described in "Regexp Quote-Like Operators" in perlop. It is also useful when writing "lex"-like scanners, when you have several patterns that you want to match against consequent substrings of your string, see the previous reference. The actual location where "\G" will match can also be influenced by using "pos()" as an lvalue: see "pos" in perlfunc. Note that the rule for zero-length matches is modified somewhat, in that contents to the left of "\G" is not counted when determining the length of the match. Thus the following will not match forever:

$str = 'ABC';
pos($str) = 1;
while (/.\G/g) {
    print $&;
}
Anonymous
Sinan Ünür
Probably to make an illustrative point. Take it up with the perlre author.
Anonymous
+1  A: 

I'm not a perl expert but you can use $- and $+ to track index location for last regex match found.
Below is code built on top of your code that explains this.

use strict;
use warnings;


my $file="sample.txt";
open(DAT, $file) || die("Could not open file!");

open (OUTPUTFILE, '>data.txt');

my $sequence;
my $someVar = 0;
my $sequenceNums = 1;

my $motif1 = "(HDWFLSFKD)";
my $motif2 = "(HD)";
my $motif3 = "(K)";
my $motif4 = "(DD)";

while (my $line = <DAT>) 
{
    $someVar = 0;
    print "\nSequence $sequenceNums: $line\n";
    print OUTPUTFILE "\nSequence $sequenceNums: $line\n";
        if ($line=~ /$motif1/g)
        {
             &printStuff($sequenceNums, "motif1", $motif1, "$-[0]-$+[0]");
             $someVar = 1;
        }


        if ($line=~ /$motif2/g and $someVar == 1)
        {
                &printStuff($sequenceNums, "motif2", $motif2, "$-[0]-$+[0]");
                $someVar = 2;
        }

        if ($line=~ /$motif3/g and $someVar == 2)
        {
                &printStuff($sequenceNums, "motif3", $motif4, "$-[0]-$+[0]");
                $someVar = 3;
        }

        if ($line=~ /$motif4/g and $someVar == 3)
        {
                &printStuff($sequenceNums, "motif4", $motif4, "$-[0]-$+[0]");
        }

        else 
        {
            $sequence .= $line;

            if ($someVar == 0)
            {
             &printWrongStuff($sequenceNums, "motif1", $motif1);
            }
            elsif ($someVar == 1)
            {
      &printWrongStuff($sequenceNums, "motif2", $motif2);
            }
            elsif ($someVar == 2)
            {
      &printWrongStuff($sequenceNums, "motif3", $motif3);
            }
            elsif ($someVar == 3)
            {
      &printWrongStuff($sequenceNums, "motif4", $motif4);
            }
        }
        $sequenceNums++;
}

sub printStuff
{
         print "Sequence: $_[0] $_[1]: $_[2] index location: $_[3] \n";
         print OUTPUTFILE "Sequence: $_[0]  $_[1]: $_[2] index location: $_[3]\n";
}

sub printWrongStuff
{
         print "Sequence: $_[0] $_[1]: $_[2] was not found\n";
         print OUTPUTFILE "Sequence: $_[0] $_[1]: $_[2] was not found\n"; 

}

close (OUTPUTFILE);
close (DAT);

Sample input:

MLTSHQKKF*HDWFLSFKD*SNNYN*HD*S*K*QNHSIK*DD*IFNRFNHYIYNDLGIRTIA MLTSHQKKFSNNYNSKQNHSIKDIFNRFNHYIYNDLGIRTIA MLTSHQKKFSNNYNSK*HDWFLSFKD*QNHSIKDIFNRFNHYIYNDL

Omnipresent
+2  A: 

Try this perl program

use strict;
use warnings;
use feature qw'say';

my $file="Sample.txt";
open( my $dat, '<', $file) || die("Could not open file!");

my @regex = qw(
  HDWFLSFKD
  HD
  K
  DD
);

my $sequence;

while( my $line = <$dat> ){
  chomp $line;

  say 'Line: ', $.;

  # reset the position of variable $line
  # pos is an lvalue subroutine
  pos $line = 0;

  for my $regex ( @regex ){
    $regex = quotemeta $regex;

    if( scalar $line =~ / \G (.*?) ($regex) /xg ){
      say $regex, ' found at location (', $-[2], '-', $+[2], ')';
      if( $1 ){
        say "    but skipped: \"$1\" at location ($-[1]-$+[1])";
      }
    }else{
      say 'Unable to find ', $regex;

      # end loop
      last;
    }
  }
}
Brad Gilbert
that is awesome and much less lines of code! gah I'm starting to love perl!
Omnipresent
It is also easy to add more regexes, if needed.
Brad Gilbert
it would help me (and maybe to OP) if you added some comments explaining what is going on. it looks great but since atleast I'm a n00b to perl I dont get some of the lines..e.g. use feature qw'say';
Omnipresent
http://perldoc.perl.org/feature.html
Brad Gilbert