tags:

views:

92

answers:

7

I am trying to extract the content of a date element from many ill-formed sgml documents. For instance, the document can contain a simple date element like

<DATE>4th July 1936</DATE>

or

<DATE blaAttrib="89787adjd98d9">4th July 1936</DATE>

but can also as hairy as:

<DATE blaAttrib="89787adjd98d9">4th July 1936
<EM>spanned across multiple lines and EM element inside DATE</EM></DATE>

The aim is to get the "4th July 1936". Since the files are not big, I chose to read the whole content into a variable and do the regex. The following is the snippet of my Perl code:

{
    local $/ = undef;
    open FILE, "$file" or die "Couldn't open file: $!";
    $fileContent = <FILE>;
    close FILE;

    if ( $fileContent =~ m/<DATE(.*)>(.*)<\/DATE>/)
    {
        # $2 should contain the "4th July 1936" but it did not.
    }
}

Unfortunately the regex does not work for the hairy example. This is because inside the <DATE> there is an <EM> element and it also spans multiple lines.

Can any kind soul give me some pointers, directions, or clues?

Thanks heaps!

+2  A: 

Use an XML parser if you can.

But from your example, probably you could try

if ($fileContent =~ m/<DATE[^>]*>([^<]+)/) {
  # use $1 here
  # you may need to strip new lines
}
KennyTM
Hi Ken.Thanks for the regex, certainly worked.The reason I did not use any XML Parser is because there are about 20,000 SGML files I need to check. Their size about 50K each. If I have to parse them I think it is an overkill and will be slow.I might be able to use sax based parser but I am not a Perl expert so just try to do this task asap and move on.
Gilbeg
A: 

There is not any way to use regex over multiple lines, but you can use a little trick. If files aren't to big, as you have mentioned, you can first replace all '\n' characters with some value (NEW_LINE or something like that), or you can delete them and then use your pattern.

Klark
There is. He's doing `local $/ = undef;` which does just that (well, it reads the whole file at once). Read up on Perl regexes in `perldoc perlre`.
MvanGeest
+3  A: 

Use an HTML parser.

Use an HTML parser.

Please, use an HTML parser.

But for a regex, I'd try

<DATE(.*?)>(.*)<\/DATE>

which should be faster than KennyTM's alternative... By the way, why are you capturing that second group?

MvanGeest
Downvote because the question states this isn't XML.
daxim
Ah, I hadn't noticed that. Still, there are some very resilient parsers around that can handle a huge mess.
MvanGeest
There are HTML parsers that would do this job nicely.
Ether
Thanks Ether. I'm still getting used to the idea that users can edit my answers, though. (I knew that when I signed up, but I always wondered how often it would happen, and why. Well, here's a legitimate reason.)
MvanGeest
I have about 20K of SGML files, I just want to check their dates. If I have to parse them say using SGML::Parser then it would be an overkill and slow. Unless I am using SAX based parser.BTW, your regex indeed worked. Thanks!
Gilbeg
+3  A: 

If the date format is fixed, you might want to use something like this:

m/<DATE(.*)>([0-9]+(st|nd|rd|th)\s(January|February|March|April|May|June|July|August|September|October|November|December)\s[0-9]+)(.*)<\/DATE>/
dark_charlie
+3  A: 

instead of matching .*, you should match "everything that is not an anchor"

ie :


 if($string =~ /^<DATE[^>]*>([^<]+)</){

there, $1 is your date

benzebuth
Many thanks... you are right like Kenny suggested.Thanks!
Gilbeg
+3  A: 

You should use non greedy matching and the modifier s to make . match newline

my @l = (
'<DATE>4th July 1936</DATE>',
'<DATE blaAttrib="89787adjd98d9">4th July 1936</DATE>',
'<DATE blaAttrib="89787adjd98d9">4th July 1936
<EM>spanned across multiple lines and EM element inside DATE</EM></DATE>'
);

foreach(@l) {
  /^<DATE.*?>(.*?)</s && print $1;
}

output:

4th July 1936
4th July 1936
4th July 1936
M42
A: 

Even your "hairy" example can be reduced to a similar type. If you are always going to have 1) the actual date on the same line as the start tag--and 2) that's all you want--it doesn't matter where the end tag is.

$fileContent =~ m/<DATE([^>]*)>\s*(\d+\p{Alpha}+\s+\p{Alpha}+\s+\d{4})/

is always going to work. (If you're not going to find '>' in the tag, then it's a good idea to not cause so much backtracking after .* eats up your entire line, causes the expression to fail and then has to give back and check, give back and check, ...)

Axeman