views:

73

answers:

3

I have looked at many questions here (and many more websites) and some provided hints but none gave me a definitive answer. I know regular expressions but I am far from being a guru. This particular question deals with regex in PHP.

I need to locate words in a text that are not surrounded by a hyperlink of a given class. For example, I might have

This <a href="blabblah" class="no_check">elephant</a> is green and this elephant is blue while this <a href="blahblah">elephant</a> is red.

I would need to match against the second and third elephants but not the first (identified by test class "no_check"). Note that there could more attributes than just href and class within hyperlinks. I came up with

((?<!<a .*class="no_check".*>)\belephant\b)

which works beautifully in regex test software but not in PHP.

Any help is greatly appreciated. If you cannot provide a regular expression but can find some sort of PHP code logic that would circumvent the need for it, I would be equally grateful.

+1  A: 

If variable width negative look-behind is not available a quick and dirty solution is to reverse the string in memory and use variable width negative look-ahead instead. then reverse the string again.

But you may be better off using an HTML parser.

Eric Strom
+1  A: 

I think the simplest approach would be to match either a complete <a> element with a "no_check" attribute, or the word you're searching for. For example:

<a [^<>]*class="no_check"[^<>]*>.*?</a>|(\belephant\b)

If it was the word you matched, it will be in capture group #1; if not, that group should be empty or null.

Of course, by "simplest approach" I really meant the simplest regex approach. Even simpler would be to use an HTML parser.

Alan Moore
A: 

I ended up using a mixed solution. It turns out that I had to parse a text for specific keywords and check if they were already part of a link and if not add them to a hyperlink. The solutions provided here were very interesting but not exactly tailored enough for what I needed.

The idea of using an HTML parser was a good one though and I am currently using one in another project. So hats off to both Alan Moore and Eric Strom for suggesting that solution.

Technoh