tags:

views:

120

answers:

4

I would like a Perl regex that matches any contiguous subset of the string '12345'.

I'm probably just having a brain-freeze, but this is my test code and current best regex. I can see how to brute-force the situation by adding alternatives, but I'm wondering which elegant alternative I'm missing. [I don't specifically need captures for the digits; I have left the sample regex without non-capturing parentheses to make it slightly less cluttered.]

Test case:

use strict;
use warnings;

my @good = qw( 1 12 123 1234 12345 2 23 234 2345 3 34 345 4 45 5);
my @bad  = qw( 0 6 13 134 1345 145 15 124 1245 125 1235 24 245 25
               35 21 32 43 54 543 5432 54321);

my $qr = qr/^(1?(2?(3(4(5)?)?)?)?)$/;   # 3 'good', 3 'bad' failures
#my $qr = qr/^(1?(2(3(4(5)?)?)?)?)$/;   # 6 'good' failures.
my $fail = 0;

foreach my $opt (@good)
{
    printf "GOOD %d: $opt - missed by regex\n", ++$fail if ($opt !~ /$qr/);
}

foreach my $opt (@bad)
{
    printf "BAD %d: $opt - allowed by regex\n", ++$fail if ($opt =~ /$qr/);
}

print(($fail == 0) ? "PASS\n" : "FAIL\n");

Sample outputs:

Case 1 (commented out):

GOOD 1: 3 - missed by regex
GOOD 2: 34 - missed by regex
GOOD 3: 345 - missed by regex
GOOD 4: 4 - missed by regex
GOOD 5: 45 - missed by regex
GOOD 6: 5 - missed by regex
FAIL

Case 2 (active):

GOOD 1: 4 - missed by regex
GOOD 2: 45 - missed by regex
GOOD 3: 5 - missed by regex
BAD 4: 13 - allowed by regex
BAD 5: 134 - allowed by regex
BAD 6: 1345 - allowed by regex
FAIL

So, can you write a nice simple, symmetric regex that matches what I want and not what I don't?


This regex lets the test case pass, but isn't as elegant as I was hoping for:

my $qr = qr/^((1?(2(3(4(5)?)?)?)?)|(3?(4(5)?)?)|5)$/;

Test case with Justin's solution

use strict;
use warnings;

my @good = qw( 1 12 123 1234 12345 2 23 234 2345 3 34 345 4 45 5);
my @bad  = qw( 0 6 13 134 1345 145 15 124 1245 125 1235 24 245 25
               35 21 32 43 54 543 5432 54321 11 122 1233 1223 12234);

#my $qr = qr/^(1?(2?(3(4(5)?)?)?)?)$/;   # 3 'good', 3 'bad' failures
#my $qr = qr/^(1?(2(3(4(5)?)?)?)?)$/;    # 6 'good' failures.
#my $qr = qr/^((1?(2(3(4(5)?)?)?)?)|(3?(4(5)?)?)|5)$/;  # Passes

# Ysth's solution - passes
#my $qr = qr/^[12345](?:(?<=1)2|(?<=2)3|(?<=3)4|(?<=4)5)*$/;

my $fail = 0;

foreach my $opt (@good)
{
    printf "GOOD %d: $opt - missed by regex\n", ++$fail if ('12345' !~ /$opt/);
    #printf "GOOD %d: $opt - missed by regex\n", ++$fail if ($opt !~ /$qr/);
}

foreach my $opt (@bad)
{
    printf "BAD %d: $opt - allowed by regex\n", ++$fail if ('12345' =~ /$opt/);
    #printf "BAD %d: $opt - allowed by regex\n", ++$fail if ($opt =~ /$qr/);
}

print(($fail == 0) ? "PASS\n" : "FAIL\n");
+1  A: 
/^[12345](?:(?<=1)2|(?<=2)3|(?<=3)4|(?<=4)5)*\z/

Sorry, got it wrong twice. This should do it. The explicit list of all possible matches is going to be faster, though.

ysth
As amended, without the anchors, it fails on 25 of the test cases; with the anchors added ('`my $qr = qr/^([12345](?:(?<=1)2|(?<=2)4|(?<=3)4|(?<=4)5)*)$/;`'), it fails on 10 of the test cases.
Jonathan Leffler
The revised version looked for 24 instead of 23. I fixed it. It now passes all cases. I'm not sure it's better than listing all cases, though.
cjm
Agreed - it works now. Also agreed that the simple listing of all cases may just be the best way to deal with it.
Jonathan Leffler
Sorry, I intentionally answered the question, not the test code, and the question doesn't call for anchors. But $ is wrong.
ysth
A: 

Do you want something better than

/\b(1|12|123|1234|12345|2|23|234|2345|3|34|345|4|45|5)\b/

?

Gabe
Yes - I was vaguely hoping for something better than that...
Jonathan Leffler
ysth's solution looks pretty good. I'd like to see a performance comparison between his and the "default" one I posted.
Gabe
This has the merit of robustness and clarity. I'll probably leave the rather nasty set of shell `case` statements dealing with the situation, anyway, but it was an interesting thought exercise.
Jonathan Leffler
+11  A: 

Reverse the match:

'12345' =~ /$opt/
Justin
I kind of like this idea, but (depending on what you know about `$opt`), you may need to validate it first: `$opt =~ /^\d+$/ and '12345' =~ /$opt/`.
cjm
Bingo - there's the lateral thinking I was missing! Congratulations.
Jonathan Leffler
In the simple test cases, the only thing this scheme lets through that I would want to fail is an empty string - and in the context, I can deal with that separately. However, as @cjm points out, if the argument is itself a regex (so the user bothers to type '`^d+$`' on the command line), then it gets past. However, since the next step is working out which of the 5 phases of a complicated compiler should be executed, the regex would not execute phase 1, or phase 2, or ... and if they're devious enough to get something through that, I'm really not that worried - they'll get whatever they get.
Jonathan Leffler
@Jonathan, someone entering `1...5` might be a bit surprised. Depending on how the later processing goes, that might execute phase 1 followed by phase 5.
cjm
+7  A: 

Here's a revised version of Justin's idea:

index('12345', $opt) >= 0;

Or, if you need to exclude the empty string

index('12345', $opt) >= 0 and length $opt;

This way, you don't need to check $opt for regex metachars. I'm not sure which version woud be faster.

cjm
This also works. In the context I am looking at (argument parsing for a command which launches C compilers), efficiency is immaterial; neatness was the criterion. This is neat too.
Jonathan Leffler
Thanks, learned a new function!
Joel