views:

314

answers:

2

Hi there. I am using preg_replace to add a link to keywords if they are found within a long HTML string. I don't want to add a link if the keyword is found within h1 tags or strong tags.

The below regex nearly works and basically says (I think): If the keyword is not immediately wrapped by either a h1 tag or a strong tag then replace with the keyword that was matched, as a bolded link to google.

$result = preg_replace('%(?!<h1>)(?!<strong>)\b(bobs widgets)\b(?!<\/strong>)(?!<\/h1>)%i','<a href="http://www.google.com"&gt;&lt;strong&gt;$1&lt;/strong&gt;&lt;/a&gt;', $result, -1);

(the reason I don't want to match if in strong tags is because I am recursing through a lot of keywords so don't want to link an already linked keyword on subsequent passes)

the above works fine and won't match:

<h1>bobs widgets</h1>

It will however match the keyword in the following text, because the h1 tag isn't immediately either side of the keyword:

<h1>Here are bobs widgets for sale</h1>

I need to make the spaces either side optional and have tried adding \s* but that doesn't get me anywhere. I'd be very grateful for a push in the right direction here.

+2  A: 

Regular expressions are the wrong tool for this job. This has been discussed many times on Stack Overflow (such as the most famous thread on the site).

What you need is an HTML parser, such as the Simple HTML DOM Parser. Do yourself a favour and use something like this from the start. Imagine what's going to happen when you run into an <h1> where someone has added an attribute, or perhaps someone has improperly closed the tags, so you have a mixed up order on a </strong> and a </h1>. Getting things like that to work with a regular expression is not worth the trouble, and sometimes isn't even possible.

zombat
Thanks for the reply zombat, I'll certainly look into the HTML parser you mentioned. I should have said that the HTML being processed isn't user-supplied and comes from another part of the script which takes plaintext and wraps it in <H1s> or <p> tags depending on their length. Is there a solution now do you think?
James
Should be okay as long as it's valid HTML, ie doesn't have '<' or various other characters that HTML has special feelings for in inappropriate places (like the back of a volkswagen). Once you learn to use the Simple HTML library (and this is not a large time investment) it's so much easier than dealing with regexes that there's no point in not using it, unless you're running this code on a pocket calculator.
intuited
so in other words you should escaping any of those characters by running the text through htmlspecialchars(): http://ca3.php.net/manual/en/function.htmlspecialchars.php
intuited
A: 

... just remember that eventually this approach will lead to sadness, and you'll need to start looking for a better approach. One way is to use 'tidy' to fix up your html into parseable xml, and then php offers a few xml manipulation APIs to work with the data.

Here's an answer anyway.

You can add some wildcards instead of the word boundaries. Something like this should do the trick:

([^<>]*)(bobs widgets)([^<>]*)

Then, add some more replacement markers to keep the remainder of your text in the output:

'$1<a href="http://www.google.com"&gt;&lt;strong&gt;$2&lt;/strong&gt;&lt;/a&gt;$3'

Now hit save and hide behind the sofa ;)

amir75