ansaurus

Question

How can you find all HTML hyper link tags in a string and replace them with their href value?

Answer 1

+5 A:

Many possibilities. E.g. by using the DOM extension, DOMDocument::loadhtml() and XPath (though getElementsbyTagName() would suffice in this case).

<?php
$string = '<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
"http://www.w3.org/TR/html4/strict.dtd"&gt;&lt;html&gt;&lt;head&gt;&lt;title&gt;...&lt;/title&gt;&lt;/head&gt;&lt;body&gt;
  <p>
    mary had a <a href="little">greedy</a> lamb
    whose fleece was <a href="white">cold</a> as snow
  </p>
</body></html>';

$doc = new DOMDocument;
$doc->loadhtml($string);

$xpath = new DOMXPath($doc);
foreach( $xpath->query('//a') as $a ) {
  $tn = $doc->createTextNode($a->getAttribute('href'));
  $a->parentNode->replaceChild($tn, $a);
}

echo $doc->savehtml();

prints

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"&gt;
<html>
<head><title>...</title></head>
<body><p>
    mary had a little lamb
    whose fleece was white as snow
  </p></body>
</html>

VolkerK 2010-04-15 12:29:27

+1 for providing a non-regex solution, as first answer.

OregonGhost 2010-04-15 12:30:26

@OregonGhost - Doing it with regex would be probably not the best idea anyway... +1 from me as well

Buggabill 2010-04-15 12:35:32

hsatterwhite 2010-04-15 13:25:11

@hsatterwhite: see http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html Regular expressions may be _part_ of the parser but regular expressions alone can't do the job (for arbitrary html)

VolkerK 2010-04-15 13:47:50

Wow, that's an excellent read! I'd up vote, but I'm too new. ;)

hsatterwhite 2010-04-15 13:54:36

Having said that if you can make certain assumptions about the string (the structure and the characters it can contain) it might be possible to use regular expressions. And Jeriko has a point in saying "at least regex won't break if the HTML isn't well-formed". Take a close look at the output of my example. Even in this case the underlying libxml made some small changes (where to use linebreaks in this case). The output is "only" equivalent. But in general you will (and should) get raised eye-browses when you use regular expressions to parse html.

VolkerK 2010-04-15 13:56:39

Yea, I noticed Jeriko's comment and I think your suggested method will work fine since I am using a WYSIWYG editor that is creating relatively good markup for these hyper links. I'm going to have to read more in to your solution considering I've never even heard of it. Is it painfully obvious that I'm a bit of a novice? ;)

hsatterwhite 2010-04-15 14:15:48

You can use a HTML parser like BeautifulSoup, which will also not break if the HTML isn't well-formed. However, to learn why regex (in general) is not a good idea for parsing HTML, look here: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454

OregonGhost 2010-04-15 15:42:04

Answer 2

A:

Sounds like you're looking for regular expressions... I'm terrible at them, but you can take a look at the PHP documentation for preg_replace.

Basically you need to match <a href="$1">$2</a> and replace it with just $1

Just a pointer in the right direction :)

Jeriko 2010-04-15 12:31:49

Regex for HTML parsing is generally not a good idea...

Buggabill 2010-04-15 12:36:16

Sure, but I've had trouble with XPath and such when there are broken tags - at least regex won't break if the HTML isn't well-formed.. Just my personal experience

Jeriko 2010-04-15 12:44:02

As I said in a comment to another answer, you can use a forgiving HTML parser like BeautifulSoup (don't know if that's available for PHP, but I guess somewhere there is something like that for PHP) rather than a regex. I didn't downvote this answer because if you really, really know how your input will look like, a regex may be the easiest solution.

OregonGhost 2010-04-15 15:45:22

Why do you need to be so certain of what your input's going to look like? Surely you're guaranteed to have <a *href="*"*>*</a>? I agree that a parser might be might easier, but, as I said, I've had some site-breaking problems with unforgiving parsers and client-generated content, so I'm once-bitten, twice shy on the whole matter :)

Jeriko 2010-04-15 15:52:02

@Jeriko: You'll find that a lot of HTML code out in the wild does not use double quotes for attribute values, and that it is written with different casing. And that's just a very simple case, if you need to parse other tags, it gets really complicated. As already said, you should use a forgiving parser if you are not *sure* if there will be non-compliant input. If you are sure, well, then maybe a regex will be the easiest solution anyway ;)

OregonGhost 2010-04-16 16:56:29

ansaurus

tags:

views:

answers:

How can you find all HTML hyper link tags in a string and replace them with their href value?

related questions