views:

2258

answers:

4

I have a function that replaces anchors' href attribute in a string using Php's DOMDocument. Here's a snippet:

$doc     = new DOMDocument('1.0', 'UTF-8');
$doc->loadHTML($text);
$anchors    = $doc->getElementsByTagName('a');

foreach($anchors as $a) {
    $a->setAttribute('href', 'http://google.com');
}

return $doc->saveHTML();

The problem is that loadHTML($text) surrounds the $text in doctype, html, body, etc. tags. I tried working around this by doing this instead of loadHTML():

$doc     = new DOMDocument('1.0', 'UTF-8');
$node    = $doc->createTextNode($text);
$doc->appendChild($node);
...

Unfortunately, this encodes all the entities (anchors included). Does anyone know how to turn this off? I've already thoroughly looked through the docs and tried hacking it, but can't figure it out.

Thanks! :)

+1  A: 

XML has only very few predefined entities. All you html entities are defined somewhere else. When you use loadhtml() these entity definitions are load automagically, with loadxml() (or no load() at all) they are not.
createTextNode() does exactly what the name suggests. Everything you pass as value is treated as text content, not as markup. I.e. if you pass something that has a special meaning to the markup (<, >, ...) it's encoded in a way a parser can distinguish the text from the actual markup (&lt;, &gt;, ...)

Where does $text come from? Can't you do the replacement within the actual html document?

VolkerK
loadHTML, no entity translation occurs. I ended up hacking around the problem in a tenuous way by running mb_substr($text, 122, -19); on the result from $doc->saveHTML(). Yikes! :)$text is a translated string with place-holder anchor tags, so the replacement has to be done during run time. I'd rather not parse the entire document as it would be difficult to parse only the translated links.Good idea though.
thesmart
A: 

I ended up hacking this in a tenuous way, changing:

return $doc->saveHTML();

into:

$text    = $doc->saveHTML();
return mb_substr($text, 122, -19);

This cuts out all the unnecessary garbage, changing this:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" 
"http://www.w3.org/TR/REC-html40/loose.dtd"&gt; <html><body><p>
You can <a href="http://www.google.com"&gt;click here</a> to visit Google.</p>
</body></html>

into this:

You can <a href="http://www.google.com"&gt;click here</a> to visit Google.

Can anyone figure out something better?

thesmart
+1  A: 
$text is a translated string with place-holder anchor tags

If these place holders have a strict, well-defined format a simple preg_replace or preg_replace_callback might do the trick.
I do not suggest fiddling about html documents with regex in general, but for a small well-defined subset they are suitable.

VolkerK
A: 

OK, here's the final solution I ended up with. Decided to go with VolkerK's suggestion.

public static function ReplaceAnchors($text, array $attributeSets)
{
 $expression = '/(<a)([\s\w\d:\/=_&\[\]\+%".?])*(>)/';

 if (empty($attributeSets) || !is_array($attributeSets)) {
  // no attributes to set. Set href="#".
  return preg_replace($expression, '$1 href="#"$3', $text);
 }

 $attributeStrs = array();
 foreach ($attributeSets as $attributeKeyVal) {
  // loop thru attributes and set the anchor
  $attributePairs = array();
  foreach ($attributeKeyVal as $name => $value) {
   if (!is_string($value) && !is_int($value)) {
    continue; // skip
   }

   $name    = htmlspecialchars($name);
   $value    = htmlspecialchars($value);
   $attributePairs[] = "$name=\"$value\"";
  }
  $attributeStrs[] = implode(' ', $attributePairs);
 }

 $i  = -1;
 $pieces = preg_split($expression, $text);
 foreach ($pieces as &$piece) {
  if ($i === -1) {
   // skip the first token
   ++$i;
   continue;
  }

  // figure out which attribute string to use
  if (isset($attributeStrs[$i])) {
   // pick the parallel attribute string
   $attributeStr = $attributeStrs[$i];
  } else {
   // pick the last attribute string if we don't have enough
   $attributeStr = $attributeStrs[count($attributeStrs) - 1];
  }

  // build a opening new anchor for this token.
  $piece = '<a '.$attributeStr.'>'.preg_replace($expression, '$1 href="#"$3', $piece);
  ++$i;
 }

 return implode('', $pieces);

This allows one to call the function with a set of different anchor attributes.

thesmart