ansaurus

Question

How can I identify the "tokens" (wrong word) of a regular expression

Answer 1

+4 A:

The re pragma can produce the information you seem to be interested in.

use strict;
use warnings;
use re qw(Debug DUMP);

my $re = qr/square[\s-]*dance/;

'Let\'s go to the square dance!' =~ $re;

Output:

Compiling REx "square[\s-]*dance"
Final program:
   1: EXACT <square> (4)
   4: STAR (17)
   5:   ANYOF[\11\12\14\15 \-][+utf8::IsSpacePerl] (0)
  17: EXACT <dance> (20)
  20: END (0)
anchored "square" at 0 floating "dance" at 6..2147483647 (checking anchored) minlen 11 
Freeing REx: "square[\s-]*dance"

Unfortunately, there doesn't appear to be a programmatic hook to get this information. You'd have to intercept the output on STDERR and parse it. Rough proof-of-concept:

sub build_regexp {
    my $string = shift;
    my $dump;

    # save off STDERR and redirect to scalar
    open my $stderr, '>&', STDERR or die "Can't dup STDERR";
    close STDERR;
    open STDERR, '>', \$dump or die;

    # Compile regexp, capturing DUMP output in $dump
    my $re = do {
        use re qw(Debug DUMP);
        qr/$string/;
    };

    # Restore STDERR
    close STDERR;
    open STDERR, '>&', $stderr or die "Can't restore STDERR";

    # Parse DUMP output
    my @atoms = grep { /EXACT/ } split("\n", $dump);

    return $re, @atoms;
}

Use it this way:

my ($re, @atoms) = build_regexp('square[\s-]*dance');

$re contains the pattern, @atoms contains a lists of the literal portions of the pattern. In this case, that's

   1: EXACT <square> (4)
  17: EXACT <dance> (20)

Michael Carman 2010-05-10 20:01:18

Too bad about it needing to redirect STDERR to get at the data, but that's a spectacular solution. Thank you!

Trueblood 2010-05-10 20:31:50

ansaurus

tags:

views:

answers:

How can I identify the "tokens" (wrong word) of a regular expression

related questions