views:

326

answers:

3

As per the HTML Purifier smoketest, 'malformed' URIs are occasionally discarded to leave behind an attribute-less anchor tag, e.g.

<a href="javascript:document.location='http://www.google.com/'"&gt;XSS&lt;/a&gt; becomes <a>XSS</a>

...as well as occasionally being stripped down to the protocol, e.g.

<a href="http://1113982867/"&gt;XSS&lt;/a&gt; becomes <a href="http:/">XSS</a>

While that's unproblematic, per se, it's a bit ugly. Instead of trying to strip these out with regular expressions, I was hoping to use HTML Purifier's own library capabilities / injectors / plug-ins / whathaveyou.

Point of reference: Handling attributes

Conditionally removing an attribute in HTMLPurifier is easy. Here the library offers the class HTMLPurifier_AttrTransform with the method confiscateAttr().

While I don't personally use the functionality of confiscateAttr(), I do use an HTMLPurifier_AttrTransform as per this thread to add target="_blank" to all anchors.

// more configuration stuff up here
$htmlDef = $htmlPurifierConfiguration->getHTMLDefinition(true);
$anchor  = $htmlDef->addBlankElement('a');
$anchor->attr_transform_post[] = new HTMLPurifier_AttrTransform_Target();
// purify down here

HTMLPurifier_AttrTransform_Target is a very simple class, of course.

class HTMLPurifier_AttrTransform_Target extends HTMLPurifier_AttrTransform
{
    public function transform($attr, $config, $context) {
        // I could call $this->confiscateAttr() here to throw away an
        // undesired attribute
        $attr['target'] = '_blank';
        return $attr;
    }
}

That part works like a charm, naturally.

Handling elements

Perhaps I'm not squinting hard enough at HTMLPurifier_TagTransform, or am looking in the wrong place(s), or generally amn't understanding it, but I can't seem to figure out a way to conditionally remove elements.

Say, something to the effect of:

// more configuration stuff up here
$htmlDef = $htmlPurifierConfiguration->getHTMLDefinition(true);
$anchor  = $htmlDef->addElementHandler('a');
$anchor->elem_transform_post[] = new HTMLPurifier_ElementTransform_Cull();
// add target as per 'point of reference' here
// purify down here

With the Cull class extending something that has a confiscateElement() ability, or comparable, wherein I could check for a missing href attribute or a href attribute with the content http:/.

HTMLPurifier_Filter

I understand I could create a filter, but the examples (Youtube.php and ExtractStyleBlocks.php) suggest I'd be using regular expressions in that, which I'd really rather avoid, if it is at all possible. I'm hoping for an onboard or quasi-onboard solution that makes use of HTML Purifier's excellent parsing capabilities.

Returning null in a child-class of HTMLPurifier_AttrTransform unfortunately doesn't cut it.

Anyone have any smart ideas, or am I stuck with regexes? :)

+2  A: 

The fact that you can't remove elements with a TagTransform appears to have been an implementation detail. The classic mechanism for removing nodes (a smidge higher-level than just tags) is to use an Injector though.

Anyway, the particular piece of functionality you're looking for is already implemented as %AutoFormat.RemoveEmpty

Edward Z. Yang
Arrrgh, so close! I changed my `HTMLPurifier_AttrTransform_Target` class around so it wouldn't add `target="_blank"` to the cases I want to strip out (and for now, to test, in the same class, strip out `href="http:/"` if I come across it, will put that in its own class later), but `AutoFormat.RemoveEmpty` still doesn't fire because there's a text node in the anchor. If there's no text in it, it's gold, it works, so, argh, so close! Thank you so much, though, it was definitely something I hadn't thought of. [I'll take a look at injectors in a moment!]
pinkgothic
Injectors loaded via `AutoFormat.Custom` seem to be called pre-purification, or at least pre-URI-purification - I'm not getting the empty tags there yet. Is there a way I can delay the call of the injector post-URI-purification?
pinkgothic
What some other filters have done is forced the attribute validation before hand, and then armored the resulting token with $token->['ValidateAttributes'] = true
Edward Z. Yang
How would I force attribute validation beforehand? I mean, the instances of `<a></a>` and `<a href="http:/"></a>` are created by the core as far as I'm aware. Where would I tell it to do that before the injector?
pinkgothic
Basically, you get an instance of HTMLPurifier_AttrValidator and then run $attr_validator->validateToken($token, $config, $context);.
Edward Z. Yang
A: 

For perusal, this is my current solution. It works, but bypasses HTML Purifier entirely.

/**
 * Removes <a></a> and <a href="http:/"></a> tags from the purified
 * HTML.
 * @todo solve this with an injector?
 * @param string $purified The purified HTML
 * @return string The purified HTML, sans pointless anchors.
 */
private function anchorCull($purified)
{
    if (empty($purified)) return '';
    // re-parse HTML
    $domTree = new DOMDocument();
    $domTree->loadHTML($purified);
    // find all anchors (even good ones)
    $anchors = $domTree->getElementsByTagName('a');
    // collect bad anchors (destroying them in this loop breaks the DOM)
    $destroyNodes = array();
    for ($i = 0; ($i < $anchors->length); $i++) {
        $anchor = $anchors->item($i);
        $href   = $anchor->attributes->getNamedItem('href');
        // <a></a>
        if (is_null($href)) {
            $destroyNodes[] = $anchor;
        // <a href="http:/"></a>
        } else if ($href->nodeValue == 'http:/') {
            $destroyNodes[] = $anchor;
        }
    }
    // destroy the collected nodes
    foreach ($destroyNodes as $node) {
        // preserve content
        $retain = $node->childNodes;
        for ($i = 0; ($i < $retain->length); $i++) {
            $rnode = $retain->item($i);
            $node->parentNode->insertBefore($rnode, $node);
        }
        // actually destroy the node
        $node->parentNode->removeChild($node);
    }
    // strip out HTML out of DOM structure string
    $html = $domTree->saveHTML();
    $begin = strpos($html, '<body>') + strlen('<body>');
    $end   = strpos($html, '</body>');
    return substr($html, $begin, $end - $begin);
}

I'd still much rather have a good HTML Purifier solution to this, so, just as a heads-up, this answer won't end up self-accepted. But in case no better answer ends up coming around, at least it might help those with similar issues. :)

pinkgothic
+2  A: 

Success! Thanks to Ambush Commander and mcgrailm in another question, I am now using a hilariously simple solution:

// a bit of context
$htmlDef = $this->configuration->getHTMLDefinition(true);
$anchor  = $htmlDef->addBlankElement('a');

// HTMLPurifier_AttrTransform_RemoveLoneHttp strips 'href="http:/"' from
// all anchor tags (see first post for class detail)
$anchor->attr_transform_post[] = new HTMLPurifier_AttrTransform_RemoveLoneHttp();

// this is the magic! We're making 'href' a required attribute (note the
// asterisk) - now HTML Purifier removes <a></a>, as well as
// <a href="http:/"></a> after HTMLPurifier_AttrTransform_RemoveLoneHttp
// is through with it!
$htmlDef->addAttribute('a', 'href*', new HTMLPurifier_AttrDef_URI());

It works, it works, bahahahaHAHAHAHAnhͥͤͫ̀ğͮ͑̆ͦó̓̉ͬ͋h́ͧ̆̈́̉ğ̈́͐̈a̾̈́̑ͨô̔̄̑̇g̀̄h̘̝͊̐ͩͥ̋ͤ͛g̦̣̙̙̒̀ͥ̐̔ͅo̤̣hg͓̈́͋̇̓́̆a͖̩̯̥͕͂̈̐ͮ̒o̶ͬ̽̀̍ͮ̾ͮ͢҉̩͉̘͓̙̦̩̹͍̹̠̕g̵̡͔̙͉̱̠̙̩͚͑ͥ̎̓͛̋͗̍̽͋͑̈́̚...! * manic laughter, gurgling noises, keels over with a smile on her face *

pinkgothic