views:

99

answers:

2

The thing I want to achieve with the code below: match a specified word case-insensitive and only once in a text and replace it with a link.

I have the following preg_match to match the word 'foo' in a string:

if (preg_match("/\bfoo\b/i", $text, $results, PREG_OFFSET_CAPTURE)) { 
  // substr_replace the word 'foo' for a link <a href.. 
}

No problem for a text without HTML, but imagine the following text with HTML:

Lorem ipsum dolor sit amet, <a href="/foo-bar/" title="foo bar">some other foo link</a> consectetur adipiscing elit foo bar.

In this case there will be a new link within the current link, because there will be a match with foo in the href-part (same problem for the title and name part).

How can the pattern be changed to match only the 'foo' outside a HTML statement?

+3  A: 

Don't parse HTML with regular expressions. Use XPath instead. PHP can easily make use of it.

The XPath expression for what you want is pretty straightforward. Assuming the tag that you want to search inside is a div, this is what you want:

//div/text()[contains(.,'foo')]

Once you have the text node, you can run a regular expression on it without the fear of it containing any HTML tags.

Welbog
Good point, but in this case there will be only some links with a fixed format and no other HTML in the text. So using XPath might be overfill.
Silverscreen
Well, using regular expressions is *impossible*. I'll take overkill over impossible any day of the week.
Welbog
A: 

You could count the number of opening and closing brackets that have been encountered so far. If they differ, it means that you've opened a bracket without having yet encountered the closing one, which means you're presently inside a HTML tag.

However, note that in general, using regular expressions for HTML parsing is a terrible idea.

John Feminella