views:

201

answers:

2

Premise

I'd like to use HTML Purifier to transform <body> tags to <div> tags, to preserve inline styling on the <body> element, e.g. <body style="background:color#000000;">Hi there.</body> would turn to <div style="background:color#000000;">Hi there.</div>. I'm looking at a combination of a custom tag and a TagTransform class.

Current setup

In my configuration section, I'm currently doing this:

$htmlDef  = $this->configuration->getHTMLDefinition(true);
// defining the element to avoid triggering 'Element 'body' is not supported'
$bodyElem = $htmlDef->addElement('body', 'Block', 'Flow', 'Core');
$bodyElem->excludes = array('body' => true);
// add the transformation rule
$htmlDef->info_tag_transform['body'] = new HTMLPurifier_TagTransform_Simple('div');

...as well as allowing <body> and its style (and class, and id) attribute via the configuration directives (they're part of a working, large list that's parsed into HTML.AllowedElements and HTML.AllowedAttributes).

I've turned definition caching off.

$config->set('Cache.DefinitionImpl', null);

Unfortunately, in this setup, it seems like HTMLPurifier_TagTransform_Simple never has its transform() method called.

HTML.Parent?

I presume the culprit is my HTML.Parent, which is set to 'div' since, quite naturally, <div> does not allow a child <body> element. However, setting HTML.Parent to 'html' nets me:

ErrorException: Cannot use unrecognized element as parent

Adding...

$htmlElem = $htmlDef->addElement('html', 'Block', 'Flow', 'Core');
$htmlElem->excludes = array('html' => true);

...gets rid of that error message but still doesn't transform the tag - it's removed instead.

Adding...

$htmlElem = $htmlDef->addElement('html', 'Block', 'Custom: head?, body', 'Core');
$htmlElem->excludes = array('html' => true);

...also does nothing, because it nets me an error message:

ErrorException: Trying to get property of non-object       

[...]/library/HTMLPurifier/Strategy/FixNesting.php:237
[...]/library/HTMLPurifier/Strategy/Composite.php:18
[...]/library/HTMLPurifier.php:181
[...]

I'm still tweaking around with the last option now, trying to figure out the exact syntax I need to provide, but if someone knows how to help me based on their own past experience, I'd appreciate any pointers in the right direction.

HTML.TidyLevel?

As the only other culprit I can imagine it being, my HTML.TidyLevel is set to 'heavy'. I've yet to try all possible constellations on this, but so far, this is making no difference.

(Since I've only been touching this secondarily, I struggle to recall which constellations I've already tried, lest I would list them here, but as it is I lack confidence I wouldn't miss something I've done or misreport something. I might edit this section later when I've done some dedicated testing, though!)

Full Configuration

My configuration data is stored in JSON and then parsed into HTML Purifier. Here's the file:

{
    "CSS" : {
        "MaxImgLength" : "800px"
    },
    "Core" : {
        "CollectErrors" : true,
        "HiddenElements" : {
            "script"   : true,
            "style"    : true,
            "iframe"   : true,
            "noframes" : true
        },
        "RemoveInvalidImg" : false
    },
    "Filter" : {
        "ExtractStyleBlocks" : true
    },
    "HTML" : {
        "MaxImgLength" : 800,
        "TidyLevel"    : "heavy",
        "Doctype"      : "XHTML 1.0 Transitional",
        "Parent"       : "html"
    },
    "Output" : {
        "TidyFormat"   : true
    },
    "Test" : {
        "ForceNoIconv" : true
    },
    "URI" : {
        "AllowedSchemes" : {
            "http"     : true,
            "https"    : true,
            "mailto"   : true,
            "ftp"      : true
        },
        "DisableExternalResources" : true
    }
}

(URI.Base, URI.Munge and Cache.SerializerPath are also set, but I've removed them in this paste. Also, HTML.Parent caveat: As mentioned, usually, this is set to 'div'.)

+1  A: 

Wouldn't it be much easier to do:

$search = array('<body', 'body>');
$replace = array('<div', 'div>');

$html = '<body style="background:color#000000;">Hi there.</body>';

echo str_replace($search, $replace, $html);

>> '<div style="background:color#000000;">Hi there.</div>';
Ben
On the final output of HTML Purifier, when I know nothing malicious has survived the process, that's probably indeed an option. However, before I end up overlooking something with a simple string-replace, I'd rather know I can rely on the solution; HTML Purifier parses and tokenises HTML reliably and given that I'm fairly sure whatever I'm overlooking is a small issue, I'd *definitely* rather have that solution. But, still, thank you. :)
pinkgothic
+2  A: 

This code is the reason why what you're doing doesn't work:

/**
 * Takes a string of HTML (fragment or document) and returns the content
 * @todo Consider making protected
 */
public function extractBody($html) {
    $matches = array();
    $result = preg_match('!<body[^>]*>(.*)</body>!is', $html, $matches);
    if ($result) {
        return $matches[1];
    } else {
        return $html;
    }
}

You can turn it off using %Core.ConvertDocumentToFragment as false; if the rest of your code is bugfree, it should work straight from there. I don't believe your bodyElem definition is necessary.j

Edward Z. Yang
*Ambush Commander to the rescue!* Thank you - awesome, this works! :D For completion's sake (if someone else stumbles across this): The `$bodyElem` definition seems to be necessary, still. I was also a bit concerned because `<title>blah</title>` was turning up `blah` in the final fragment, but then remembered the I can just add `'head'` to the `Core.HiddenElements` list. Now it works like a charm!
pinkgothic
And another quick add-on for completion's sake: `<body>` and its `style`-attribute does not need to be in the tag whitelist, just the tag its transformed into (and its attribute).
pinkgothic