ansaurus

Question

Strip anchors down to their contents, only if the anchor's URL contains...

Answer 1

+2 A:

Firstly, this isn't a regex problem (or at least it shouldn't be). PHP comes with an HTML parser so I'd strongly recommend using that.

When you use that you just need to loop through all the anchor tags, check the href attribute and modify if necessary then save it back to HTML. For example:

$dom = new DOMDocument;
$dom->loadHTML($html); // $html as a string
$anchors = $dom->getElementsByTagName('a');
for ($i=0; i<$anchors->length; $i++) {
  $item = $anchors->item[$i];
  $href = $item->getAttribute('href');
  $host = parse_url($href, PHP_URL_HOST);
  if (stripos($host, 'yahoo') !== false) {
    $item->parentNode->removeChild($item);
  }
}
$html = $dom->saveHTML();

Using parse_url() here is optional. You could simply check if the attribute value had "yahoo" anywhere in it without pulling out just the host name.

This is significantly better and more robust than any regex based solution for the same problem.

cletus 2010-03-12 07:36:47

-1|If he was about to change the files permanently, he'd be better of using a powerful editor to do the job.

aefxx 2010-03-12 08:21:24

Ok, your solution looks great, but 2 more questions. As for performance and memory usage, how efficient would this be in comparison to a regex solution? Seems like there would be much more overhead for this option. Also, I haven't tested this yet, but it's seems that in your example you are simply modifying the href attribute of the anchor and not stripping the anchor of it's tags. I still don't know what the regex would be for this but I'm thinking that a preg_replace would do the trick.

Tony 2010-03-12 19:09:26

@Tony if you're doing this as part of rendering a page then network latency is likely to be a far bigger factory unless you're doing this on an exceptionally large document. Memory usage is basically a linear function of the size of the document as is the processing time so this scales well. Regexes can be more unpredictable if you get into excessive backtracking scenarios.

cletus 2010-03-13 00:47:17

@Tony also changed to remove the element.

cletus 2010-03-13 00:50:25

Thanks cletus, but I still don't think you are reading the problem correctly. I'd like to strip the tags only and leave the contents of the anchor remaining, only if the href contains yahoo. Heres another example: `<a href="http://books.yahoo.com">This Text</a>` -> This Text

Tony 2010-03-13 18:42:07

Answer 2

A:

Try this function.

public function stripAnchorTags($html, $ignore_host = false, $charset="UTF-8"){
        $dom = new DOMDocument;
        $dom->loadHTML('<?xml version="1.0" encoding="'.$charset.'"?>'.$html); // $html as a string
        $anchors = $dom->getElementsByTagName('a');
        $length = $anchors->length;
        for($i=0; $i<$length; $i++){
            $item = $anchors->item(0);
            $href = $item->getAttribute('href');
            $host = parse_url($href, PHP_URL_HOST);
            if(!$ignore_host || stripos($host, $ignore_host) === false) {
                $item->parentNode->replaceChild($dom->createTextNode($href),$item);
            }
        }
        return preg_replace('/^<!DOCTYPE.+?>/', '', str_replace( array('<html>', '</html>', '<body>', '</body>'), array('', '', '', ''), $dom->saveXML($dom->documentElement)));
    }

You can use it like this stripAnchorTags($html);

If you want it to ignore yahoo links then call it like this stripAnchorTags($html, "yahoo");

AWinter 2010-09-08 03:21:48

ansaurus

tags:

views:

answers:

Strip anchors down to their contents, only if the anchor's URL contains...

related questions