views:

245

answers:

2

Does anyone know a regex function in PHP to strip an anchor of its contents, only if the anchor's href attribute contains specific text?

For example, I have an HTML page and there are links throughout. But I want to strip only the anchors that contain "yahoo" in the URL. So <a href="http://pages.yahoo.com/page1"&gt;Example page</a> would become: Example, while other anchors in the HTML not containing "yahoo" would be left alone.

+2  A: 

Firstly, this isn't a regex problem (or at least it shouldn't be). PHP comes with an HTML parser so I'd strongly recommend using that.

When you use that you just need to loop through all the anchor tags, check the href attribute and modify if necessary then save it back to HTML. For example:

$dom = new DOMDocument;
$dom->loadHTML($html); // $html as a string
$anchors = $dom->getElementsByTagName('a');
for ($i=0; i<$anchors->length; $i++) {
  $item = $anchors->item[$i];
  $href = $item->getAttribute('href');
  $host = parse_url($href, PHP_URL_HOST);
  if (stripos($host, 'yahoo') !== false) {
    $item->parentNode->removeChild($item);
  }
}
$html = $dom->saveHTML();

Using parse_url() here is optional. You could simply check if the attribute value had "yahoo" anywhere in it without pulling out just the host name.

This is significantly better and more robust than any regex based solution for the same problem.

cletus
-1|If he was about to change the files permanently, he'd be better of using a powerful editor to do the job.
aefxx
Ok, your solution looks great, but 2 more questions. As for performance and memory usage, how efficient would this be in comparison to a regex solution? Seems like there would be much more overhead for this option. Also, I haven't tested this yet, but it's seems that in your example you are simply modifying the href attribute of the anchor and not stripping the anchor of it's tags. I still don't know what the regex would be for this but I'm thinking that a preg_replace would do the trick.
Tony
@Tony if you're doing this as part of rendering a page then network latency is likely to be a far bigger factory unless you're doing this on an exceptionally large document. Memory usage is basically a linear function of the size of the document as is the processing time so this scales well. Regexes can be more unpredictable if you get into excessive backtracking scenarios.
cletus
@Tony also changed to remove the element.
cletus
Thanks cletus, but I still don't think you are reading the problem correctly. I'd like to strip the tags only and leave the contents of the anchor remaining, only if the href contains yahoo. Heres another example: `<a href="http://books.yahoo.com">This Text</a>` -> This Text
Tony
A: 

Try this function.

public function stripAnchorTags($html, $ignore_host = false, $charset="UTF-8"){
        $dom = new DOMDocument;
        $dom->loadHTML('<?xml version="1.0" encoding="'.$charset.'"?>'.$html); // $html as a string
        $anchors = $dom->getElementsByTagName('a');
        $length = $anchors->length;
        for($i=0; $i<$length; $i++){
            $item = $anchors->item(0);
            $href = $item->getAttribute('href');
            $host = parse_url($href, PHP_URL_HOST);
            if(!$ignore_host || stripos($host, $ignore_host) === false) {
                $item->parentNode->replaceChild($dom->createTextNode($href),$item);
            }
        }
        return preg_replace('/^<!DOCTYPE.+?>/', '', str_replace( array('<html>', '</html>', '<body>', '</body>'), array('', '', '', ''), $dom->saveXML($dom->documentElement)));
    }

You can use it like this stripAnchorTags($html);

If you want it to ignore yahoo links then call it like this stripAnchorTags($html, "yahoo");

AWinter