ansaurus

Question

Perl regex parse forward only; not end-to-start

Answer 1

+2 A:

Greg Bacon 2010-08-21 03:46:09

Thank you, unfortunately due to software constraints I'm unable to use non-core modules on the production machines but I tested this solution on a dev machine and it works beautifully for those looking at this question in a normal environment.

WSkid 2010-08-21 06:33:25

@WSkid You're welcome. I'm glad you were able to get past the issue you were having.

Greg Bacon 2010-08-21 11:05:01

Answer 2

+1 A:

You could try a non-greedy match using .+? or .*? to keep it from slurping up the rest of the file.

Brian Phillips 2010-08-21 04:02:34

Ah, thank you - I knew I was over looking so simple!

WSkid 2010-08-21 06:31:24

Answer 3

+3 A:

You have to be careful with the regex when parsing HTML or similar structures. There are two issues with the regex you're trying:

Nested tags (font-tag in the first entry)
Line breaks (before the first closing anchor tag)

Here's a regex that deals with those:

use HTML::Entities;
while ($string =~ m/<DIV style="margin-left:([0-9]+)px; text-indent:[-0-9]+px"><A href="#([0-9]+)">(.*?)<\/A>/gis) {
    my $indent = $1;
    my $page = $2;
    (my $name = $3) =~ s/\s+$//;
    $name =~ s/^\s+//;
    $name =~ s/<.*?>//g;
    print $indent, '|', decode_entities($name), '|', $page, "\n";
}

jmz 2010-08-21 05:37:33

Thank you! This is a perfect example a complete answer - I have to use my own html entities function due to lack of external modules but otherwise this was spot on!

WSkid 2010-08-21 06:32:06

ansaurus

tags:

views:

answers:

Perl regex parse forward only; not end-to-start

related questions