tags:

views:

136

answers:

3

I'm trying to parse an XML file I get from an external source but am having problems because there are unencoded XML entities in the text nodes.

Essentially, I'm asking the same question as this, but for Perl instead of PHP.

<report>  
  <company>A & W</company>  
  <company>Some Other Company with a < in Inc.</company>
</report>  

I tried using something like this:

my $readAllRecordsURI = "http://mycompany.com/CompanyOnline/GetRecord";
my @form_array = ("action" => "readAll", "table" => "QOPIDINF");

my $ua = LWP::UserAgent->new;

my $cics_request = (POST $readAllRecordsURI, \@form_array);          
my $cics_response = $ua->request($cics_request);
my $xmlfile = $cics_response->content;

my $parser = XML::Parser->new( Handlers => {Char  => \&handle_char});
$parser->parsefile( $xmlfile );


sub handle_char {
   my ($p, $string) = @_;

   #clean up text here...
}
A: 

XML::Parser / Expat has always worked well for me, including with poorly formed XML.

Do NOT parse XML with a regex.... unless your parser does not work >;-} ... Can you just deleted the company name with a < in it before parsing?

Here are some regexs to try: XML Shallow Parsing with regex -- At the bottom of that page I think there is a regex that will find only correct XML tags; invert that to find poorly formed?

drewk
I'm pretty sure my problem is with the handler I'm using.
Mrouge
A: 

Take a look at XML::Liberal. It appears to do just what you want. A very simple example (from one of the unit tests):

my $clean_xml = XML::Liberal->new('LibXML')->parse_string($bad_xml)->to_string()
Brian Phillips
This module looks good. Unfortunately, It still gave me an error. It didn't like some of the text. I definitely keep this module in mind for other stuff like this.
Mrouge
+1  A: 

This really isn't the answer, but it solves my problem. What I've done is gone back to the programmer that provided the XML and asked him to have it encode the text properly to avoid all this.

Mrouge
I really think this is the *best* solution anyway.
Brian Phillips