tags:

views:

94

answers:

3

Parsing HTML with regex is a bad idea, but it seems suitable for this situation.

Description: Given a .html file, I must parse the internal links, extract the indent level, text of the link and the page number it resides on to an external .txt file which is then passed on to someone else.

So given this sample HTML:

<TR valign="bottom">
    <TD valign="top"><DIV style="margin-left:0px; text-indent:-0px"><A href="#101"><FONT style="font-variant:small-caps;">The &#147;Offering&#147;</FONT>
</A></DIV></TD>
    <TD>&nbsp;</TD>
    <TD nowrap align="right" valign="top">&nbsp;</TD>
    <TD align="right" valign="top">1</TD>
    <TD nowrap valign="top">&nbsp;</TD>
</TR>
<TR valign="bottom">
    <TD valign="top"><DIV style="margin-left:15px; text-indent:-0px"><A href="#102">Sales &#038; Property
</A></DIV></TD>
    <TD>&nbsp;</TD>
    <TD nowrap align="right" valign="top">&nbsp;</TD>
    <TD align="right" valign="top">2</TD>
    <TD nowrap valign="top">&nbsp;</TD>
</TR>

The external file will produce:

0|The "Offering"|4
15|Sales & Property|5

(page numbers are different because they are the actual page number, not the folio reference).

I have this mostly figured out except for 1 part, when the text of the link contains extra HTML codes, like the <Font> tag in the first link.

Here is my regex to extract the links (note $string contains the html above):

while ($string =~ m/<DIV style="margin-left:([0-9]+)px; text-indent:[-0-9]+px"><A href="#([0-9]+)">([a-zA-Z0-9\.,:;&#\s]+)<\/A>/gi) {
    push(@indents,$1);
    push(@linkIDs,$2);
    push(@names,escapeHTML($3));
};

That will correctly extract the second one, but not the first, because of the >< and other symbols in the HTML code.

If I change that last capture group to .+ or .*, I get the entire HTML file (well, between the first <Div><A> and the last </A>. It seems that the pattern is starting at the beginning, but matching from the end of the file backwards.

Here is a link to an online regex builder: http://regexr.com?2s0po
It correctly finds what I need, but in Perl I do not get the same results (just the whole file as mentioned).

I can't seem to write anything that will capture each group correctly - you would think the "cursor" would move forward and stop at the first </A> it saw from the start of the file.

Any help or opinions or guidance would be greatly appreciated. -Thank you.

+2  A: 
Greg Bacon
Thank you, unfortunately due to software constraints I'm unable to use non-core modules on the production machines but I tested this solution on a dev machine and it works beautifully for those looking at this question in a normal environment.
WSkid
@WSkid You're welcome. I'm glad you were able to get past the issue you were having.
Greg Bacon
+1  A: 

You could try a non-greedy match using .+? or .*? to keep it from slurping up the rest of the file.

Brian Phillips
Ah, thank you - I knew I was over looking so simple!
WSkid
+3  A: 

You have to be careful with the regex when parsing HTML or similar structures. There are two issues with the regex you're trying:

  1. Nested tags (font-tag in the first entry)
  2. Line breaks (before the first closing anchor tag)

Here's a regex that deals with those:

use HTML::Entities;
while ($string =~ m/<DIV style="margin-left:([0-9]+)px; text-indent:[-0-9]+px"><A href="#([0-9]+)">(.*?)<\/A>/gis) {
    my $indent = $1;
    my $page = $2;
    (my $name = $3) =~ s/\s+$//;
    $name =~ s/^\s+//;
    $name =~ s/<.*?>//g;
    print $indent, '|', decode_entities($name), '|', $page, "\n";
}
jmz
Thank you! This is a perfect example a complete answer - I have to use my own html entities function due to lack of external modules but otherwise this was spot on!
WSkid