views:

60

answers:

3

For the life of me I cannot understand the XML::Twig documentation for entity handling.

I've got some XML I'm generating with HTML::Tidy. The call is as follows:

my $tidy = HTML::Tidy->new({
    'indent'          => 1,
    'break-before-br' => 1,
    'output-xhtml'    => 0,
    'output-xml'      => 1,
    'char-encoding'   => 'raw',
});

$str = "foo   bar";
$xml = $tidy->clean("<xml>$str</xml>");

which produces:

<html>
  <head>
    <meta content="tidyp for Linux (v1.02), see www.w3.org" name="generator" />
    <title></title>
  </head>
  <body>foo &nbsp; bar</body>
</html>

XML::Twig (understandably) barfs at the &nbsp;. I want to do some transformations, running it through XML::Twig:

my $twig = XML::Twig->new(
  twig_handlers => {... handlers ...}
);

$twig->parse($xml);

The $twig->parse line barfs on the &nbsp;, but I can't figure out how to add the &nbsp; element programmatically. I tried things like:

my $entity = XML::Twig::Entity->new("nbsp", "&#160;");
$twig->entity_list->add($entity);
$twig->parse($xml);

... but no joy.

Please help =)

A: 

There maybe a better way but below did work for me:

my $filter = sub {
    my $text  = shift;
    my $ascii = "\x{a0}";    # non breaking space
    my $nbsp  = '&nbsp;';
    $text =~ s/$ascii/$nbsp/;
    return $text;
};

XML::Twig->new( output_filter => $filter )
         ->parse_html( $xml )
         ->print;

/I3az/

draegtun
I'm a little hesitant to do regexp parsing on XML if there's a programmatic approach. I'll keep this in reserve if I can't find a way that's more "correct" to accomplish it (thanks).
Sir Robert
+1  A: 
use strict;
use XML::Twig;

my $doctype = '<?xml version="1.0" encoding="utf-8"?><!DOCTYPE html [<!ENTITY nbsp "&#160;">]>';
my $xml = '<html><head><meta content="tidyp for Linux (v1.02), see www.w3.org" name="generator" /><title></title></head><body>foo &nbsp; bar</body></html>';

my $xTwig = XML::Twig->new();

$xTwig->safe_parse($doctype . $xml) or die "Failure to parse XML : $@";

print $xTwig->sprint();
bob.faist
This is pretty close to what I'd (ideally) like. It does still rely on text manipulation (in my code) rather than utilizing the XML::Twig API with literal data, but I could simply use this to declare a standard DTD for my incoming data. Upvoted for utility and it may be the accepted answer (after I fiddle with it)
Sir Robert
I finally accepted this as the best answer. I ended up doing something slightly different, but still text manipulation. This is probably the best general solution.
Sir Robert
+2  A: 

A dirty, but efficient, trick in a case like this would be to add a fake DTD declaration.

Then XML::Parser, which does the parsing, will assume that the entity is defined in the DTD and won't barf on it.

To get rid of the fake DTD declaration, you can output the root of the twig. If you need a different declaration, create it and replace the current one:

#!/usr/bin/perl 

use strict;
use warnings;

use XML::Twig;

my $fake_dtd= '<!DOCTYPE head SYSTEM "foo"[]>'; # foo may not even exist

my $xml='<html>
  <head>
    <meta content="tidyp for Linux (v1.02), see www.w3.org" name="generator" />
    <title></title>
  </head>
  <body>foo &nbsp; bar</body>
</html>';

XML::Twig->new->parse( $fake_dtd . $xml)->root->print;
mirod
@mirod: Hmm.. thanks. That's actually quite helpful!
Sir Robert