views:

61

answers:

2

I'd like to take a string of text and find all of the hyper link tags, grab their href value, and replace the entire hyper link tag with the value of the href attribute.

+5  A: 

Many possibilities. E.g. by using the DOM extension, DOMDocument::loadhtml() and XPath (though getElementsbyTagName() would suffice in this case).

<?php
$string = '<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
"http://www.w3.org/TR/html4/strict.dtd"&gt;&lt;html&gt;&lt;head&gt;&lt;title&gt;...&lt;/title&gt;&lt;/head&gt;&lt;body&gt;
  <p>
    mary had a <a href="little">greedy</a> lamb
    whose fleece was <a href="white">cold</a> as snow
  </p>
</body></html>';

$doc = new DOMDocument;
$doc->loadhtml($string);

$xpath = new DOMXPath($doc);
foreach( $xpath->query('//a') as $a ) {
  $tn = $doc->createTextNode($a->getAttribute('href'));
  $a->parentNode->replaceChild($tn, $a);
}

echo $doc->savehtml();

prints

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"&gt;
<html>
<head><title>...</title></head>
<body><p>
    mary had a little lamb
    whose fleece was white as snow
  </p></body>
</html>
VolkerK
+1 for providing a non-regex solution, as first answer.
OregonGhost
@OregonGhost - Doing it with regex would be probably not the best idea anyway... +1 from me as well
Buggabill
hsatterwhite
@hsatterwhite: see http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html Regular expressions may be _part_ of the parser but regular expressions alone can't do the job (for arbitrary html)
VolkerK
Wow, that's an excellent read! I'd up vote, but I'm too new. ;)
hsatterwhite
Having said that if you can make certain assumptions about the string (the structure and the characters it can contain) it might be possible to use regular expressions. And Jeriko has a point in saying "at least regex won't break if the HTML isn't well-formed". Take a close look at the output of my example. Even in this case the underlying libxml made some small changes (where to use linebreaks in this case). The output is "only" equivalent. But in general you will (and should) get raised eye-browses when you use regular expressions to parse html.
VolkerK
Yea, I noticed Jeriko's comment and I think your suggested method will work fine since I am using a WYSIWYG editor that is creating relatively good markup for these hyper links. I'm going to have to read more in to your solution considering I've never even heard of it. Is it painfully obvious that I'm a bit of a novice? ;)
hsatterwhite
You can use a HTML parser like BeautifulSoup, which will also not break if the HTML isn't well-formed. However, to learn why regex (in general) is not a good idea for parsing HTML, look here: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454
OregonGhost
A: 

Sounds like you're looking for regular expressions... I'm terrible at them, but you can take a look at the PHP documentation for preg_replace.

Basically you need to match <a href="$1">$2</a> and replace it with just $1

Just a pointer in the right direction :)

Jeriko
Regex for HTML parsing is generally not a good idea...
Buggabill
Sure, but I've had trouble with XPath and such when there are broken tags - at least regex won't break if the HTML isn't well-formed.. Just my personal experience
Jeriko
As I said in a comment to another answer, you can use a forgiving HTML parser like BeautifulSoup (don't know if that's available for PHP, but I guess somewhere there is something like that for PHP) rather than a regex. I didn't downvote this answer because if you really, really know how your input will look like, a regex may be the easiest solution.
OregonGhost
Why do you need to be so certain of what your input's going to look like? Surely you're guaranteed to have <a *href="*"*>*</a>? I agree that a parser might be might easier, but, as I said, I've had some site-breaking problems with unforgiving parsers and client-generated content, so I'm once-bitten, twice shy on the whole matter :)
Jeriko
@Jeriko: You'll find that a lot of HTML code out in the wild does not use double quotes for attribute values, and that it is written with different casing. And that's just a very simple case, if you need to parse other tags, it gets really complicated. As already said, you should use a forgiving parser if you are not *sure* if there will be non-compliant input. If you are sure, well, then maybe a regex will be the easiest solution anyway ;)
OregonGhost