ansaurus

Question

How can I have a minimal match between two known tokens?

Answer 1

+4 A:

Although I don't like how much it backtracks, making the catchall greedy between BEGIN and RATE will allow you to skip to the RATE in the section where CODE=XX. Like this:

$tx = qr/(START \s+ ITEM \s+ = \s+ 9983 \s+ 
                        BEGIN
                            .*
                           RATE \s+ = \s+ )\d+
...

The main problem with this is that it will jump into another ITEM if necessary, so you have to make sure you don't gobble up STOP. Like so:

my $tx = qr/(START \s+ ITEM \s+ = \s+ 9983 \s+
                 BEGIN
                     (?: (?! \b STOP \b ) . )*
                    RATE \s+ = \s+ )\d+
                         (.*?       # Goes too far
                     CODE \s+ = \s+ XX)
          /msx
          ;

It still backtracks more than I'd like.

(An hour later) I realized that the RATE and the CODE field whose value is XX must not be divided by an END. Thus another solution is:

my $tx = qr/(START \s+ ITEM \s+ = \s+ 9983 \s+
                 BEGIN
                     .*?
                    RATE \s+ = \s+ )\d+
                         ((?:(?! ^ \s+ END \s* $ ) . )*? 
                     CODE \s+ = \s+ XX)
                        /msx
                        ;

( I revised this to only look for END by itself in a line. If ADDITIONAL TEXT could contain a single END, then it would be hard to parse no matter what)

I'm thinking this one doesn't backtrack as much, because it just starts from RATE = and then scans for CODE = before it hits END if we don't have CODE = XX, then it prunes back to the position where it thought it matched RATE and goes looking for the next RATE. We could add the negative lookahead for STOP if we don't know that Item #9983 is definitely going to have a code of 'XX'.

Edited to eliminate false \s problem.

Note: this now takes the following section:

START ITEM = 9983
    BEGIN
        WORD
        RATE = 01
        MORE WORDS
        CODE = AA
        STUFF
    END
    BEGIN
        TEXT
        MORE WORDS
        RATE = 99
        ADDITIONAL TEXT <-- DON'T END HERE!
        CODE = XX
        OTHER THINGS
    END
STOP

Axeman 2009-09-01 20:25:25

That wasn't the problem. The \s's got lost as I was transcribing into my browser from a non-internet connected system. Sorry.

clintp 2009-09-01 20:31:39

@clintp: fixed it.

Axeman 2009-09-01 20:57:31

@Axeman This greedy-with-negative-lookahead strategy is a good idea, but the specific solution fails for `1235` and `XX`. I think you can solve the problem by applying the approach more widely, using similar lookaheads for END.

FM 2009-09-01 21:58:43

@FM: Funny, I didn't read your comment before updating my expression, but your suggestion is there, now. :)

Axeman 2009-09-01 22:00:40

Answer 2

+6 A:

A regex is poorly suited for this sort of problem. I recommend a simple iterative solution:

while (<FILE>) {
    # push lines straight to output until we find the START that we want
    print OUT $_;
    next unless m/START ITEM = $number/;

    # save the lines until we get to the CODE that we want
    my @lines;
    while (<FILE>)
    {
        push @lines, $_;
        last if m/CODE = $code/;
    }

    # @lines now has everything from the START to the CODE. Get the last RATE in
    # @lines and change its value.
    my $strref = \( grep m/RATE/ @lines )[-1];
    $$strref = $new_value;

    # print out the lines we saved and exit the loop
    print OUT @lines;
    last;
}

Edit: If you really want a regex, you can use something like this (untested):

$tx =~ s/(START \s+ ITEM \s+ = \s+ 9983.*?
                            BEGIN
                                .*?
                               RATE \s+ = \s+ )\d+
                                ( (?: (?! END ) . )*
                                    CODE \s+ = \s+ XX)
                        /$1$newRate$2
                        /sx;

The added (?: (?! END ) . )* ensures that the match between RATE and CODE doesn't cross an END. But this will be massively slower than a non-regex approach.

JSBangs 2009-09-01 20:25:51

You got to it faster than I did. +1!

bobbymcr 2009-09-01 20:26:41

I wasn't kidding when I said my toolbox had perl regexes in them. The actual code relies on the regex being added onto a stack of regexes (sort of a callback regex) that are each applied in turn to the text. Heck, I could do a full-on parser if I had the text. But I was hoping not to have to break that code open and stick with the API I've got.

clintp 2009-09-01 20:36:49

Sounds like clintp has a DailyWTF entry.

Brad Gilbert 2009-09-02 02:28:33

@JSBangs: Speed isn't the issue, so this should work just dandy. Thanks.

clintp 2009-09-02 13:45:10

@clintp: you should put those comments in your original question.

brian d foy 2009-09-02 17:30:11

If `ADDITIONAL TEXT` contains the phrase "DIDN'T ABEND" or something like that nothing will match. That's why I used word boundaries (and now an isolated word on a line.)

Axeman 2009-09-03 16:59:30

@briandfoy: comments added to question

clintp 2009-09-12 03:00:51

Answer 3

A:

Regular expressions is not always the best answer for parsing text. Your example shows that you really have a file that can be represented with a grammar. It will be much simpler to use a parser to extract the fields and then do the update on the extracted information.

David Harris 2009-09-01 20:34:36

ansaurus

tags:

views:

answers:

How can I have a minimal match between two known tokens?

related questions