views:

73

answers:

5

Possible Duplicate:
RegEx match open tags except XHTML self-contained tags

I need to do some regex replacement on HTML input, but I need to exclude some parts from filtering by other regexp.

(e.g. remove all <a> tags with specific href="example.com…, except the ones that are inside the <form> tag)

Is there any smart regex technique for this? Or do I have to find all forms using $regex1, then split the input to the smaller chunks, excluding the matched text blocks, and then run the $regex2 on all the chunks?

+1  A: 

Why can't you just dump the html string you need into a DOM helper, then use getElementsByTagName('a') to grab all anchors and use getAttribute to get the href, removeChild to remove it?

meder
The OP probably has a big HTML file coming in and the whole question is finding <a> inside of <form> tags.
Dmitri Farkov
A: 

Any particular reason you would want to do that with Regular Expressions? It sounds like it would be fairly straightforward in Javascript to spin through the DOM and to it that way.

In jQuery, for instance, it seems like you could do this in just a couple lines using its DOM selectors.

Jason Kester
What about a fallback option in case the user does not have JS enabled?
Dmitri Farkov
A: 
  • If forms can be nested, it is technically impossible.
  • If forms can not be nested, it is practically impossible. There is no function where you can use the same regex to
    1. define an area where the matching should be done (i.e. outside form)
    2. define things to be matched (i.e. elements)
Sjoerd
+1  A: 

This looks like PHP, right? http://htmlpurifier.org/

MiffTheFox
+2  A: 

The NON-regexp way:

<?php
$html = '<html><body><a href="foo">a <b>bold</b> foz </a> b c <form><a href="foo">l</a></form> <a href="boz">a</a></body></html>';
$d = new DOMDocument();
$d->loadHTML($html);
$x = new DOMXPath($d);
$elements = $x->query('//a[not(ancestor::form) and @href="foo"]');
foreach($elements as $elm){
        //run if contents of <a> should be visible:
        while($elm->firstChild){
                $elm->parentNode->insertBefore($elm->firstChild,$elm);
        }
        //remove a
        $elm->parentNode->removeChild($elm);
}
var_dump($d->saveXML());
?>
Wrikken