ansaurus

Question

Find multiple substrings in strings and record location

Answer 1

+1 A:

You really should read

You need the special variables @- and @+ if you need the positions. No need to try to compute them yourself.

#!/usr/bin/perl

use strict;
use warnings;

use List::MoreUtils qw( each_array );

my $source = 'AAAA   BBCCC   DD  E      FFFFF';
my $pattern = join '\s*', map { "($_+)" } qw( A B C D E F );



if ( $source =~ /$pattern/ ) {
    my $it = each_array @-, @+;

    $it->(); # discard overall match information;

    while ( my ($start, $end) = $it->() ) {
        printf "Start: %d - Length: %d\n", $start, $end - $start;
    }
}

Start: 0 - Length: 4
Start: 7 - Length: 2
Start: 9 - Length: 3
Start: 15 - Length: 2
Start: 19 - Length: 1
Start: 26 - Length: 5

Sinan Ünür 2009-05-07 13:00:47

Answer 2

A:

The result of a construct like

$line=~ /(HD)/g

is a list. Use while to step through the hits.

fgm 2009-05-07 13:01:36

The OP needs the start positions as well as the lengths.

Sinan Ünür 2009-05-07 13:11:09

Answer 3

A:

To match where the last match left off, use \G. perldoc perlre says (but consult your own installation's version's manual first):

The "\G" assertion can be used to chain global matches (using "m//g"), as described in "Regexp Quote-Like Operators" in perlop. It is also useful when writing "lex"-like scanners, when you have several patterns that you want to match against consequent substrings of your string, see the previous reference. The actual location where "\G" will match can also be influenced by using "pos()" as an lvalue: see "pos" in perlfunc. Note that the rule for zero-length matches is modified somewhat, in that contents to the left of "\G" is not counted when determining the length of the match. Thus the following will not match forever:
$str = 'ABC';
pos($str) = 1;
while (/.\G/g) {
    print $&;
}

Anonymous 2009-05-07 13:13:26

Sinan Ünür 2009-05-07 13:19:34

Probably to make an illustrative point. Take it up with the perlre author.

Anonymous 2009-05-07 13:48:51

Answer 4

+1 A:

I'm not a perl expert but you can use $- and $+ to track index location for last regex match found.
Below is code built on top of your code that explains this.

use strict;
use warnings;


my $file="sample.txt";
open(DAT, $file) || die("Could not open file!");

open (OUTPUTFILE, '>data.txt');

my $sequence;
my $someVar = 0;
my $sequenceNums = 1;

my $motif1 = "(HDWFLSFKD)";
my $motif2 = "(HD)";
my $motif3 = "(K)";
my $motif4 = "(DD)";

while (my $line = <DAT>) 
{
    $someVar = 0;
    print "\nSequence $sequenceNums: $line\n";
    print OUTPUTFILE "\nSequence $sequenceNums: $line\n";
        if ($line=~ /$motif1/g)
        {
             &printStuff($sequenceNums, "motif1", $motif1, "$-[0]-$+[0]");
             $someVar = 1;
        }


        if ($line=~ /$motif2/g and $someVar == 1)
        {
                &printStuff($sequenceNums, "motif2", $motif2, "$-[0]-$+[0]");
                $someVar = 2;
        }

        if ($line=~ /$motif3/g and $someVar == 2)
        {
                &printStuff($sequenceNums, "motif3", $motif4, "$-[0]-$+[0]");
                $someVar = 3;
        }

        if ($line=~ /$motif4/g and $someVar == 3)
        {
                &printStuff($sequenceNums, "motif4", $motif4, "$-[0]-$+[0]");
        }

        else 
        {
            $sequence .= $line;

            if ($someVar == 0)
            {
             &printWrongStuff($sequenceNums, "motif1", $motif1);
            }
            elsif ($someVar == 1)
            {
      &printWrongStuff($sequenceNums, "motif2", $motif2);
            }
            elsif ($someVar == 2)
            {
      &printWrongStuff($sequenceNums, "motif3", $motif3);
            }
            elsif ($someVar == 3)
            {
      &printWrongStuff($sequenceNums, "motif4", $motif4);
            }
        }
        $sequenceNums++;
}

sub printStuff
{
         print "Sequence: $_[0] $_[1]: $_[2] index location: $_[3] \n";
         print OUTPUTFILE "Sequence: $_[0]  $_[1]: $_[2] index location: $_[3]\n";
}

sub printWrongStuff
{
         print "Sequence: $_[0] $_[1]: $_[2] was not found\n";
         print OUTPUTFILE "Sequence: $_[0] $_[1]: $_[2] was not found\n"; 

}

close (OUTPUTFILE);
close (DAT);

Sample input:

MLTSHQKKF*HDWFLSFKD*SNNYN*HD*S*K*QNHSIK*DD*IFNRFNHYIYNDLGIRTIA MLTSHQKKFSNNYNSKQNHSIKDIFNRFNHYIYNDLGIRTIA MLTSHQKKFSNNYNSK*HDWFLSFKD*QNHSIKDIFNRFNHYIYNDL

Omnipresent 2009-05-07 16:54:34

Answer 5

+2 A:

Try this perl program

use strict;
use warnings;
use feature qw'say';

my $file="Sample.txt";
open( my $dat, '<', $file) || die("Could not open file!");

my @regex = qw(
  HDWFLSFKD
  HD
  K
  DD
);

my $sequence;

while( my $line = <$dat> ){
  chomp $line;

  say 'Line: ', $.;

  # reset the position of variable $line
  # pos is an lvalue subroutine
  pos $line = 0;

  for my $regex ( @regex ){
    $regex = quotemeta $regex;

    if( scalar $line =~ / \G (.*?) ($regex) /xg ){
      say $regex, ' found at location (', $-[2], '-', $+[2], ')';
      if( $1 ){
        say "    but skipped: \"$1\" at location ($-[1]-$+[1])";
      }
    }else{
      say 'Unable to find ', $regex;

      # end loop
      last;
    }
  }
}

Brad Gilbert 2009-05-07 19:47:45

that is awesome and much less lines of code! gah I'm starting to love perl!

Omnipresent 2009-05-07 20:27:05

It is also easy to add more regexes, if needed.

Brad Gilbert 2009-05-07 21:20:10

it would help me (and maybe to OP) if you added some comments explaining what is going on. it looks great but since atleast I'm a n00b to perl I dont get some of the lines..e.g. use feature qw'say';

Omnipresent 2009-05-08 02:22:38

http://perldoc.perl.org/feature.html

Brad Gilbert 2009-05-08 13:17:14

ansaurus

tags:

views:

answers:

Find multiple substrings in strings and record location

Try this perl program

related questions