tags:

views:

798

answers:

2

I have been looking through SO and although this question has been answered in one scenario:

Regex to match all words except a given list

It's not quite what I'm looking for. I am trying to write a regular expression which matches any string of the form [\w]+[(], but which doesn't match the three strings "cat(", "dog(" and "sheep(" specifically.

I have been playing with lookahead and lookbehind, but I can't quite get there. I may be overcomplicating this, so any help would be greatly appreciated.

+3  A: 

If the regular expression implementation supports look-ahead or look-behind assertions, you could use the following:

  • Using a negative look-ahead assertion:

     \b(?!(?:cat|dog|sheep)\()\w+\(
    
  • Using a negative look-behind assertion:

     \b\w+\((?<!\b(?:cat|dog|sheep)\()
    

I added the \b anchor that marks a word boundary. So catdog( would be matched although it contains dog(.

But while look-ahead assertions are more widely supported by regex implementations, the regex with the look-behind assertion is more efficient since it’s only tested if the preceding regex (in our case \b\w+\() already did match. However the look-ahead assertion would be tested before the actual regex would match. So in our case the look-ahead assertion is tested whenever \b is matched.

Gumbo
The second one is most likely efficient since it doesn't check every single position with a negative look-ahead (it's worth noting that they're negative.) Also, I'm thinking it might be better to put the negative look-behind after the parenthesis and include a parenthesis in the look-behind. This way, it will only perform an extra look-behind once it finds a possible match, rather than for every word in the string.
Blixt
@Blixt: Good point.
Gumbo
Also, your first regex will reject `catastrophe(`, `dogmatic(` and `sheepily(`. Your second one is saved from a similar error by the `\b` in the look-behind.
rampion
Right, I had a go with > grep '\b(?!(?:cat|dog|sheep))\w+[(]' text.txt text.txtcat()dog()catdog()something()And it's not returning anything. I also had a go in textmate with the regular expression search but nada. I can see the logic behind the first statement though, perhaps this is a compatibility issue? I thought look-ahead was pretty standard. I've certainly been using it today in some form or another.
Huguenot
ah, no carriage returns, the second text.txt indicates the file's contents.
Huguenot
grep uses POSIX regular expressions, not PCRE, which are slightly different. I don't think the POSIX standard includes lookbehinds or lookaheads.
rampion
Thank you, it was an unfortunate coincidence that both methods weren't working properly. Textmate had copied over a carriage return into the regular expression search box. Thanks very much for the help everyone
Huguenot
I have modifed the first one to "\b(?!cat\(|dog\(|sheep\()\w+\(" to prevent the problem mentioned by Rampion and that seems to work. For some reason, it didn't like the word boundary in the second expression, so I changed it to a \W à la \b[A-Za-z]+\((?<!\W(?:cat\(|dog\(|rat\()) and that seems to have done the trick. Note that I had to change the third term to the same size as the others, fortunately this is not a problem.
Huguenot
The first regex just needed grouping parens around the alternation: `\b(?!(?:cat|dog|sheep)\()\w+\(`
Alan Moore
@Alan M: You’re right, thanks.
Gumbo
A: 

Do you really require this in a single regex? If not, then the simplest implementation is just two regexes - one to check you don't match one of your forbidden words, and one to match your \w+, chained with a logical AND.

ire_and_curses