I am trying to extract the content of a date element from many ill-formed sgml documents. For instance, the document can contain a simple date element like
<DATE>4th July 1936</DATE>
or
<DATE blaAttrib="89787adjd98d9">4th July 1936</DATE>
but can also as hairy as:
<DATE blaAttrib="89787adjd98d9">4th July 1936
<EM>spanned across multiple lines and EM element inside DATE</EM></DATE>
The aim is to get the "4th July 1936". Since the files are not big, I chose to read the whole content into a variable and do the regex. The following is the snippet of my Perl code:
{
local $/ = undef;
open FILE, "$file" or die "Couldn't open file: $!";
$fileContent = <FILE>;
close FILE;
if ( $fileContent =~ m/<DATE(.*)>(.*)<\/DATE>/)
{
# $2 should contain the "4th July 1936" but it did not.
}
}
Unfortunately the regex does not work for the hairy example. This is because inside the <DATE>
there is an <EM>
element and it also spans multiple lines.
Can any kind soul give me some pointers, directions, or clues?
Thanks heaps!