ansaurus

Question

How can Perl's XML::Simple ignore HTML embedded in XML?

Answer 1

+2 A:

If the HTML is included directly in the XML (rather than being escaped or inside a CDATA) then there is no way for XML::Simple to know where to stop parsing.

However, you can reconstitute just the HTML by passing that section of the data structure to XML::Simple's XMLout() function.

marnanel 2010-04-14 20:34:33

Answer 2

A:

If the HTML is not inside CDATA construct or otherwise encoded, what you can do is a slight hack.

Before processing with XML::Simple, find the contents of <my_html> tag which are presumably suspect HTML, and pass them through HTML entity encoder ("<" => "&lt'" etc...) like HTML::Entities. Then insert encoded content instead of the original content of <my_html> tag.

This is VERY hacky, VERY easy to do incorrectly unless you know 100% what you're doing with regular expressions, and should not be done.

Having said that, it WILL solve your problem.

DVK 2010-04-14 20:38:22

Answer 3

+3 A:

#!/usr/bin/perl

use strict; use warnings;

use XML::LibXML::Reader;
my $reader = XML::LibXML::Reader->new(IO => \*DATA)
    or die "Cannot read XML\n";

if ( $reader->nextElement('content') ) {
    print $reader->readInnerXml;
}

__DATA__
<content>
<div xmlns="http://www.w3.org/1999/xhtml"&gt;
<p><a href="http://miamiherald.typepad.com/" style="float:left"><img
src="tada"/></a></p>
</div>
</content>

Output:

<div xmlns="http://www.w3.org/1999/xhtml"&gt;
<p><a href="http://miamiherald.typepad.com/" style="float:left"><img src="tada"/
></a></p>
</div>

Sinan Ünür 2010-04-15 10:29:28

Answer 4

+3 A:

My general rule is that when XML::Simple starts to fail, it's time to move on to another XML processing module. XML::Simple is really supposed to be for situations that you don't need to think about. Once you have a weird case that you have to think about, you're going to have to do some extra work that I usually find quite kludgey to integrate with XML::Simple.

brian d foy 2010-04-16 04:19:00

ansaurus

tags:

views:

answers:

How can Perl's XML::Simple ignore HTML embedded in XML?

related questions