ansaurus

Question

PHP preg_replace - Don't match within h1 tags

Answer 1

+2 A:

Regular expressions are the wrong tool for this job. This has been discussed many times on Stack Overflow (such as the most famous thread on the site).

What you need is an HTML parser, such as the Simple HTML DOM Parser. Do yourself a favour and use something like this from the start. Imagine what's going to happen when you run into an <h1> where someone has added an attribute, or perhaps someone has improperly closed the tags, so you have a mixed up order on a </strong> and a </h1>. Getting things like that to work with a regular expression is not worth the trouble, and sometimes isn't even possible.

zombat 2010-03-29 00:52:13

Thanks for the reply zombat, I'll certainly look into the HTML parser you mentioned. I should have said that the HTML being processed isn't user-supplied and comes from another part of the script which takes plaintext and wraps it in <H1s> or <p> tags depending on their length. Is there a solution now do you think?

James 2010-03-29 01:00:10

Should be okay as long as it's valid HTML, ie doesn't have '<' or various other characters that HTML has special feelings for in inappropriate places (like the back of a volkswagen). Once you learn to use the Simple HTML library (and this is not a large time investment) it's so much easier than dealing with regexes that there's no point in not using it, unless you're running this code on a pocket calculator.

intuited 2010-03-29 01:16:24

so in other words you should escaping any of those characters by running the text through htmlspecialchars(): http://ca3.php.net/manual/en/function.htmlspecialchars.php

intuited 2010-03-29 01:20:18

Answer 2

A:

... just remember that eventually this approach will lead to sadness, and you'll need to start looking for a better approach. One way is to use 'tidy' to fix up your html into parseable xml, and then php offers a few xml manipulation APIs to work with the data.

Here's an answer anyway.

You can add some wildcards instead of the word boundaries. Something like this should do the trick:

([^<>]*)(bobs widgets)([^<>]*)

Then, add some more replacement markers to keep the remainder of your text in the output:

'$1<a href="http://www.google.com"&gt;&lt;strong&gt;$2&lt;/strong&gt;&lt;/a&gt;$3'

Now hit save and hide behind the sofa ;)

amir75 2010-03-29 01:06:49

ansaurus

tags:

views:

answers:

PHP preg_replace - Don't match within h1 tags

related questions