tags:

views:

74

answers:

3

I have the following regular expression to find word in text and highlight them

Using the word surface for testing purposes.

/((?<=[\W])surface?(?![\w]))|((?<![\w])surface?(?=[\W]))/iu

It matches all occurences in the following text.

surface-CoP-20-70-0000-04-02_Pre-Run_Tool_Verification_Programming_and_surface_Tare surface_revC.pdf

But if i change the first occurence of surface to contain a upper case letter, it only matches the first occurence.

Surface-CoP-20-70-0000-04-02_Pre-Run_Tool_Verification_Programming_and_surface_Tare surface_revC.pdf

Or if i put an upper case letter in some of the other occurences it matches that.

Surface-CoP-20-70-0000-04-02_Pre-Run_Tool_Verification_Programming_and_Surface_Tare surface_revC.pdf

+1  A: 

I have no idea what you're trying to achieve there, but possibly your problem is that \w will include _ (and \W will exclude it).

Maybe try this:

/(?<![a-z])surface(?![a-z])/iu

Or this:

/(?<=[\W_])surface(?=[\W_])/iu

Otherwise, please provide more details on what exactly you do/don't want to match.


Update: given this information:

surface2010 should not be matched

In that case, I suspect you want:

/(?<=\b|_)surface(?=\b|_)/iu

(since just \b would exclude a match containing "...and_surface_Tare..." so we add the alternation with _ to include that.)

Peter Boughton
I want to match whole words in text, not surf in surface etc. Words followed or preceded by space or any other non word character.I use this to highlight these words in the text.Should be case insensitive, thats when the problem occurs.It also has to match the word by itself, when there are no other words or characters which it doesnt
oddi
Define "non word character". In regex, a word character (`\w`) is `[A-Za-z0-9_]` which might not be what you want - hence the two options I posted above. The first of these (or a slight adaption) should give you what you want. (The `i` flag makes it case-insensitive, and it's unlikely that PHP has a bug in that.)
Peter Boughton
A: 

Am I missing something?

/\bsurface\b/i
strager
This will not match `_surface_` because `\b` is a change between `\w` and `\W` and the `_` character is included in `\w`.
Peter Boughton
@Peter Boughton, Then do something like: `/(?<=_|\b)surface(?=_|\b)/i`
strager
Yeah, which is similar to my `[\W_]` one, although that wont match just "surface" - but probably the first one I listed is preferred anyhow. Need clarification from the OP whether "surface2010" should be matched or not.
Peter Boughton
surface2010 should not be matched
oddi
A: 

So you want to match surface case-insensitively unless it's preceded or followed immediately by a letter or digit? Try this:

/(?<![A-Za-z0-9])surface(?![A-Za-z0-9])/i

I left off the /u modifier (which causes the regex and the subject string to be treated as UTF-8) because you appear to be dealing with pure ASCII text. \w, \W and \b are not affected by /u anyway.

Alan Moore