views:

1441

answers:

5

I have a string of arbitrary length, and starting at position p0, I need to find the first occurrence of one of three 3-letter patterns.

Assume the string contain only letters. I need to find the count of triplets starting at position p0 and jumping forward in triplets until the first occurrence of either 'aaa' or 'bbb' or 'ccc'.

Is this even possible using just a regex?

+12  A: 
$string=~/^   # from the start of the string
            (?:.{$p0}) # skip (don't capture) "$p0" occurrences of any character
            (?:...)*?  # skip 3 characters at a time,
                       # as few times as possible (non-greedy)
            (aaa|bbb|ccc) # capture aaa or bbb or ccc as $1
         /x;

(Assuming p0 is 0-based).

Of course, it's probably more efficient to use substr on the string to skip forward:

substr($string, $p0)=~/^(?:...)*?(aaa|bbb|ccc)/;
Mike G.
will $1 contain the count of triplets?
slashmais
(pls ignore my comment above) good one, but I need the count of triplets skipped - or maybe now that I know the first one, I can find it's pos and just use the difference from p0 as the count.
slashmais
No, $1 will contain the matched triplet (aaa or bbb or ccc). But you can use $-[1] to access its starting position
moritz
If you need the count of starting triplets, just remove the ?: from the triplets parens to make them capturing. Then length($1)/3 is the number of skipped triplets. But brian d foy's answer looks better...
Mike G.
+9  A: 

You can't really count with regexes, but you can do something like this:

pos $string = $start_from;
$string =~ m/\G         # anchor to previous pos()
            ((?:...)*?) # capture everything up to the match
            (aaa|bbb|ccc)
            /xs  or die "No match"
my $result = length($1) / 3;

But I think it's a bit faster to use substr() and unpack() to split into triple and walk the triples in a for-loop.

(edit: it's length(), not lenght() ;-)

moritz
+11  A: 

Moritz says this might be faster than a regex. Even if it's a little slower, it's easier to understand at 5 am. :)

             #0123456789.123456789.123456789.  
my $string = "alsdhfaaasccclaaaagalkfgblkgbklfs";  
my $pos    = 9;  
my $length = 3;  
my $regex  = qr/^(aaa|bbb|ccc)/;

while( $pos < length $string )    
    {  
    print "Checking $pos\n";  

    if( substr( $string, $pos, $length ) =~ /$regex/ )
        {
        print "Found $1 at $pos\n";
        last;
        }

    $pos += $length;
    }
brian d foy
yes - I should write KISS on a poster stick it on the wall above my monitor! Thanx.
slashmais
A: 

The main part of this is split /(...)/. But at the end of this, you'll have your positions and occurrence data.

my @expected_triplets = qw<aaa bbb ccc>;
my $data_string      
    = 'fjeidoaaaivtrxxcccfznaaauitbbbfzjasdjfncccftjtjqznnjgjaaajeitjgbbblafjan'
    ;
my $place          = 0;
my @triplets       = grep { length } split /(...)/, $data_string;
my %occurrence_for = map { $_, [] } @expected_triplets;
foreach my $i ( 0..@triplets ) {
    my $triplet = $triplets[$i];
    push( @{$occurrence_for{$triplet}}, $i ) if exists $occurrence_for{$triplet};
}

Or for simple counting by regex (it uses Experimental (??{}))

my ( $count, %count );
my $data_string      
    = 'fjeidoaaaivtrxxcccfznaaauitbbbfzjasdjfncccftjtjqznnjgjaaajeitjgbbblafjan'
    ;
$data_string =~ m/(aaa|bbb|ccc)(??{ $count++; $count{$^N}++ })/g;
Axeman
A: 

If speed is a serious concern, you can, depending on what the 3 strings are, get really fancy by creating a tree (e.g. Aho-Corasick algorithm or similar).

A map for every possible state is possible, e.g. state[0]['a'] = 0 if no strings begin with 'a'.

Brian