tags:

views:

176

answers:

3

I have a selection of text that looks like the following. I need to do a rudimentary edit on it, but can't fathom the regex that I need. Maybe it's just been a long day, and I'm not seeing what I need.

Sample data:

START ITEM = 1235
    BEGIN
        WORD
        RATE = 98
        MORE WORDS
        CODE = XX
        STUFF
    END
    BEGIN
        TEXT
        MORE WORDS
        RATE = 57
        ADDITIONAL TEXT
        CODE = YY
        OTHER THINGS
    END
STOP
START ITEM = 9983
    BEGIN
        WORD
        RATE = 01
        MORE WORDS
        CODE = AA
        STUFF
    END
    BEGIN
        TEXT
        MORE WORDS
        RATE = 99
        ADDITIONAL TEXT
        CODE = XX
        OTHER THINGS
    END
STOP

I'm given a CODE and an ITEM number, and need to edit the rate in the appropriate BEGIN/END section. Fortunately, the sections are well-defined with STOP/START BEGIN/END (they're keywords, and aren't anywhere else).

My toolbox for this is Perl regular expressions.*

The first solution I tried didn't work (values hard-coded):

    $tx =~ s/(START \s ITEM \s = \s 9983.*?
                            BEGIN
                                .*?
                               RATE \s = \s )\d+
                                    (.*?       # Goes too far
                                CODE \s = \s XX)
                        /$1$newRate$2
                        /sx;

Because the indicated code winds up matching too much, finding the correct code farther down but always editing the first entry.

Suggestions?


* The actual code relies on the regex being added onto a stack of regexes (sort of a post-processing filter) that are each applied in turn to the text to do edits. Heck, I could do a full-on parser if I had the text. But I was hoping not to have to break that code open and stick with the API I've got.

+4  A: 

Although I don't like how much it backtracks, making the catchall greedy between BEGIN and RATE will allow you to skip to the RATE in the section where CODE=XX. Like this:

$tx = qr/(START \s+ ITEM \s+ = \s+ 9983 \s+ 
                        BEGIN
                            .*
                           RATE \s+ = \s+ )\d+
...

The main problem with this is that it will jump into another ITEM if necessary, so you have to make sure you don't gobble up STOP. Like so:

my $tx = qr/(START \s+ ITEM \s+ = \s+ 9983 \s+
                 BEGIN
                     (?: (?! \b STOP \b ) . )*
                    RATE \s+ = \s+ )\d+
                         (.*?       # Goes too far
                     CODE \s+ = \s+ XX)
          /msx
          ;

It still backtracks more than I'd like.

(An hour later) I realized that the RATE and the CODE field whose value is XX must not be divided by an END. Thus another solution is:

my $tx = qr/(START \s+ ITEM \s+ = \s+ 9983 \s+
                 BEGIN
                     .*?
                    RATE \s+ = \s+ )\d+
                         ((?:(?! ^ \s+ END \s* $ ) . )*? 
                     CODE \s+ = \s+ XX)
                        /msx
                        ;

( I revised this to only look for END by itself in a line. If ADDITIONAL TEXT could contain a single END, then it would be hard to parse no matter what)

I'm thinking this one doesn't backtrack as much, because it just starts from RATE = and then scans for CODE = before it hits END if we don't have CODE = XX, then it prunes back to the position where it thought it matched RATE and goes looking for the next RATE. We could add the negative lookahead for STOP if we don't know that Item #9983 is definitely going to have a code of 'XX'.


Edited to eliminate false \s problem.

Note: this now takes the following section:

START ITEM = 9983
    BEGIN
        WORD
        RATE = 01
        MORE WORDS
        CODE = AA
        STUFF
    END
    BEGIN
        TEXT
        MORE WORDS
        RATE = 99
        ADDITIONAL TEXT <-- DON'T END HERE!
        CODE = XX
        OTHER THINGS
    END
STOP
Axeman
That wasn't the problem. The \s's got lost as I was transcribing into my browser from a non-internet connected system. Sorry.
clintp
@clintp: fixed it.
Axeman
@Axeman This greedy-with-negative-lookahead strategy is a good idea, but the specific solution fails for `1235` and `XX`. I think you can solve the problem by applying the approach more widely, using similar lookaheads for END.
FM
@FM: Funny, I didn't read your comment before updating my expression, but your suggestion is there, now. :)
Axeman
+6  A: 

A regex is poorly suited for this sort of problem. I recommend a simple iterative solution:

while (<FILE>) {
    # push lines straight to output until we find the START that we want
    print OUT $_;
    next unless m/START ITEM = $number/;

    # save the lines until we get to the CODE that we want
    my @lines;
    while (<FILE>)
    {
        push @lines, $_;
        last if m/CODE = $code/;
    }

    # @lines now has everything from the START to the CODE. Get the last RATE in
    # @lines and change its value.
    my $strref = \( grep m/RATE/ @lines )[-1];
    $$strref = $new_value;

    # print out the lines we saved and exit the loop
    print OUT @lines;
    last;
}

Edit: If you really want a regex, you can use something like this (untested):

$tx =~ s/(START \s+ ITEM \s+ = \s+ 9983.*?
                            BEGIN
                                .*?
                               RATE \s+ = \s+ )\d+
                                ( (?: (?! END ) . )*
                                    CODE \s+ = \s+ XX)
                        /$1$newRate$2
                        /sx;

The added (?: (?! END ) . )* ensures that the match between RATE and CODE doesn't cross an END. But this will be massively slower than a non-regex approach.

JSBangs
You got to it faster than I did. +1!
bobbymcr
I wasn't kidding when I said my toolbox had perl regexes in them. The actual code relies on the regex being added onto a stack of regexes (sort of a callback regex) that are each applied in turn to the text. Heck, I could do a full-on parser if I had the text. But I was hoping not to have to break that code open and stick with the API I've got.
clintp
Sounds like clintp has a DailyWTF entry.
Brad Gilbert
@JSBangs: Speed isn't the issue, so this should work just dandy. Thanks.
clintp
@clintp: you should put those comments in your original question.
brian d foy
If `ADDITIONAL TEXT` contains the phrase "DIDN'T ABEND" or something like that nothing will match. That's why I used word boundaries (and now an isolated word on a line.)
Axeman
@briandfoy: comments added to question
clintp
A: 

Regular expressions is not always the best answer for parsing text. Your example shows that you really have a file that can be represented with a grammar. It will be much simpler to use a parser to extract the fields and then do the update on the extracted information.

David Harris