views:

616

answers:

1

I am a complete Perl newb, but I am certain that learning Perl will be easier than figuring out how to parse XML in awk. I would like to parse the .sgm files from this dataset:

http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.html

This is a collection of 20,000 Reuters articles from newswire from a decade ago, and is a standard test set for certain types of text processing. To simplify my perl testing, I grabbed the first few hundred lines from the first file and made test.sgm until my script worked correctly on that. It starts out like this:

<!DOCTYPE lewis SYSTEM "lewis.dtd">
<REUTERS TOPICS="YES" LEWISSPLIT="TRAIN" CGISPLIT="TRAINING-SET" OLDID="5544" NEWID="1">
<DATE>26-FEB-1987 15:01:01.79</DATE>
<TOPICS><D>cocoa</D></TOPICS>
<PLACES><D>el-salvador</D><D>usa</D><D>uruguay</D></PLACES>
<PEOPLE></PEOPLE>
<ORGS></ORGS>
<EXCHANGES></EXCHANGES>
<COMPANIES></COMPANIES>
<UNKNOWN> 
&#5;&#5;&#5;C T
&#22;&#22;&#1;f0704&#31;reute
u f BC-BAHIA-COCOA-REVIEW   02-26 0105</UNKNOWN>
<TEXT>&#2;
<TITLE>BAHIA COCOA REVIEW</TITLE>
<DATELINE>    SALVADOR, Feb 26 - </DATELINE><BODY>Showers continued throughout the week in
the Bahia cocoa zone, alleviating the drought since early
January and improving prospects for the coming temporao,...

I used a perl script from http://www.xml.com/pub/a/2001/05/16/perlxml.html as an example, and ended up with this, extract.pl:

use XML::DOM;

my $file = $ARGV[0];

my $parser = XML::DOM::Parser->new();
my $doc = $parser->parsefile($file);

#print $doc->getElementsByTagName('DATE');

print "\n";

and I get this output:

> perl extract.pl test.sgm

reference to invalid character number at line 11, column 0, byte 343 at /usr/lib64/perl5/vendor_perl/5.8.5/x86_64-linux-thread-multi/XML/Parser.pm line 187
>

Google doesn't help (the top hit appears to be a page that is experiencing the same error I am) and my Perl hacker friend is still hung over from Blackhat in Vegas. Any ideas what I'm doing wrong, or how I can clean the file? I assume the badness is happening inside that "Unknown" tag, which I don't even need. I really just want to extract the text from every article. If you need more info please let me know.

+4  A: 

The numeric character reference "&#5;" is not legal in valid XML Documents. I refer you to the section 4.1 Character and Entity References in the XML recommendation:

Characters referred to using character references MUST match the production for Char.

Now if we follow the link and look at the production for Char:

Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

we see that there are some characters that can appear neither literally, nor as a numeric character reference in a valid XML Document.

An oddity that; I've learned something about XML today :).

See this conversation on ASCII control characters in XML for a possible workaround.

Inshallah
Well then. It appears I'm doing nothing wrong. Since I'm neither creating the XML nor using the invalid characters for anything useful, a simple sed "s//bad/g" <test.sgm >cleantest.sgmSeems to do the trick. Well, it's still complaining about "junk after document element at line 72" but that's unrelated.Thanks for tracking down that XML archive for me.
PlexLuthor