views:

99

answers:

2

Hi, I am working on this PHP function. The idea is to wrap certain words occuring in a string into certain tags (both, words and tags, given in an array). It works OK!, but when those words occur into a linked text or its 'src' attribute, then of course the link is broken and stuffed with tags, or tags that should not be inside a link are generated. This is what I have now:

function replace() {
  $terminos = array (
  "beneficios" => "h3",
  "valoracion" => "h2",
  "empresarios" => "h2",
  "tecnologias" => "h2",
  "...and so on..." => "etc",
  );

  foreach ($terminos as $key => $value)
  {
  $body = "string where the word empresarios should be replaced; but the word <a href='http://www.empresarios.com'&gt;empresarios&lt;/a&gt; should not be replaced inside <a> tags nor in the URL of their 'src' attribute.";
  $tagged = "<".$value.">".$key."</".$value.">";
  $result = str_replace($key, $tagged, $body);
  }
}

The function, in this example, should return "string where the word <h2>empresarios</h2> should be replaced; but the word <a href='http://www.empresarios.com'&gt;empresarios&lt;/a&gt; should not be replaced inside <a> tags nor in the URL of their 'src' attribute."

I'd like this replacement function to work all throught the string, but not inside tags nor in its attributes!

(I'd like to do what is mentioned in the following thread, it's just that it's not in javascript what I need, but in PHP: /questions/1666790/how-to-replace-text-not-within-a-specific-tag-in-javascript)

A: 

To the answer you pointed, in JS, it's basically the same. You just have to specify it's a string.

$regexp = "/(<pre>(?:[^<](?!\/pre))*<\/pre>)|(\:\-\))/gi";

Also note that you may be need another preg_replace function to replace the word 'empresarios' in case it's capitalized (Empresarios) or like weird stuff (EmPreSAriOS).

Also take care of your HTML. <h2> are block elements and may be interpretated this way:

string where the word empresarios should be replaced;

And replaced

string where the word

empresarios

should be replaced;

Maybe what you'll need to use is a <big> tag.

metrobalderas
Thanks metrobalderas, I really did try your solution, but really, when it comes to regexp, I am such a newbie and I don't understand well what Ia m doing. Thanks anyway, probably your answer is useful for others!
Alextronic
+2  A: 

Use the DOM and only modify text nodes:

$s = "foo <a href='http://test.com'&gt;foo&lt;/a&gt; lorem bar ipsum foo. <a>bar</a> not a test";
echo htmlentities($s) . '<hr>';

$d = new DOMDocument;
$d->loadHTML($s);

$x = new DOMXPath($d);
$t = $x->evaluate("//text()");

$wrap = array(
    'foo' => 'h1',
    'bar' => 'h2'
);

$preg_find = '/\b(' . implode('|', array_keys($wrap)) . ')\b/';

foreach($t as $textNode) {
    if( $textNode->parentNode->tagName == "a" ) {
        continue;
    }

    $sections = preg_split( $preg_find, $textNode->nodeValue, null, PREG_SPLIT_DELIM_CAPTURE);

    $parentNode = $textNode->parentNode;

    foreach($sections as $section) {  
        if( !isset($wrap[$section]) ) {
            $parentNode->insertBefore( $d->createTextNode($section), $textNode );
            continue;
        }

        $tagName = $wrap[$section];
        $parentNode->insertBefore( $d->createElement( $tagName, $section ), $textNode );
    }

    $parentNode->removeChild( $textNode );
}

echo htmlentities($d->saveHTML());

Edited to replace DOMText with DOMText and DOMElement as necessary.

Adam Backstrom
Hi Adam, thank you! That was helpful, I only needed to do @$d->loadHTML($remote); in order to get rid of invalid markup messages. However, the problem now is that the parsing contains ASCII characters, so I get a load of them and also tags visible in the output $d->saveHTML();...How can we get rid of that?!
Alextronic
I've again modified the code to search thorough each text node and replace matched strings with DOMElement objects.
Adam Backstrom
that's great Adam, let me check it! thanks
Alextronic
well Adam, it DOES work!! The only problem now is that I am working with spanish texts. This is what I get when there are characters such as á é í ó ú ñ...:"La ética profesional y el reducido tamaño de nuestros equipos garantizan la más estricta confidencialidad y discreción de nuestros proyectos."I have been trying different functions to encode this correctly but, to no avail. Will post when I find a solution.THANKS!
Alextronic
Also tried to preceed $s with '<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />', but didn't work.
Alextronic