Parsing HTML with regex is a bad idea, but it seems suitable for this situation.
Description: Given a .html file, I must parse the internal links, extract the indent level, text of the link and the page number it resides on to an external .txt file which is then passed on to someone else.
So given this sample HTML:
<TR valign="bottom">
<TD valign="top"><DIV style="margin-left:0px; text-indent:-0px"><A href="#101"><FONT style="font-variant:small-caps;">The “Offering“</FONT>
</A></DIV></TD>
<TD> </TD>
<TD nowrap align="right" valign="top"> </TD>
<TD align="right" valign="top">1</TD>
<TD nowrap valign="top"> </TD>
</TR>
<TR valign="bottom">
<TD valign="top"><DIV style="margin-left:15px; text-indent:-0px"><A href="#102">Sales & Property
</A></DIV></TD>
<TD> </TD>
<TD nowrap align="right" valign="top"> </TD>
<TD align="right" valign="top">2</TD>
<TD nowrap valign="top"> </TD>
</TR>
The external file will produce:
0|The "Offering"|4
15|Sales & Property|5
(page numbers are different because they are the actual page number, not the folio reference).
I have this mostly figured out except for 1 part, when the text of the link contains extra HTML codes, like the <Font>
tag in the first link.
Here is my regex to extract the links (note $string contains the html above):
while ($string =~ m/<DIV style="margin-left:([0-9]+)px; text-indent:[-0-9]+px"><A href="#([0-9]+)">([a-zA-Z0-9\.,:;&#\s]+)<\/A>/gi) {
push(@indents,$1);
push(@linkIDs,$2);
push(@names,escapeHTML($3));
};
That will correctly extract the second one, but not the first, because of the >< and other symbols in the HTML code.
If I change that last capture group to .+
or .*
, I get the entire HTML file (well, between the first <Div><A>
and the last </A>
. It seems that the pattern is starting at the beginning, but matching from the end of the file backwards.
Here is a link to an online regex builder: http://regexr.com?2s0po
It correctly finds what I need, but in Perl I do not get the same results (just the whole file as mentioned).
I can't seem to write anything that will capture each group correctly - you would think the "cursor" would move forward and stop at the first </A>
it saw from the start of the file.
Any help or opinions or guidance would be greatly appreciated. -Thank you.