tags:

views:

2119

answers:

9

I'm attempting something which I feel should be fairly obvious to me but it's not. I'm trying to match a string which does NOT contain a specific sequence of characters. I've tried using [^ab], [^(ab)], etc. to match strings containing no 'a's or 'b's, or only 'a's or only 'b's or 'ba' but not match on 'ab'. The examples I gave won't match 'ab' it's true but they also won't match 'a' alone and I need them to. Is there some simple way to do this?

+2  A: 

Using a regex as you described is the simple way (as far as I am aware). If you want a range you could use [^a-f].

Copas
+12  A: 

Use negative lookahead:

^(?!.*ab).*$
Alan Moore
I believe this is more efficient: (?:(?!ab).)*
Blixt
Also wants to use start/end markers to enforce the check on the whole string.
Peter Boughton
@Blixit: yes, it is. But it's also harder to read, especially for regex newbies. The one I posted will be efficient enough for most applications.
Alan Moore
@Peter: I was fixing that as you posted the comment. Anchors aren't necessary in all cases (eg, when using Java's matches() method), but they don't hurt anything either.
Alan Moore
Don't write code aimed at newbies! If code is hard to read, leave comments/documentation so they can learn, instead of using lesser code that keeps them ignorant.
Peter Boughton
@Peter: I second that!
Paulo Santos
If I had thought there would be a noticeable difference between the two approaches, I wouldn't have hesitated to recommend the faster one. On the other hand, regexes are so opaque (if not cryptic), I think it's worthwhile to break the knowledge into smaller, more manageable chunks whenever possible.
Alan Moore
+11  A: 

Using a character class such as [^ab] will match a single character that is not within the set of characters. (With the ^ being the negating part).

To match a string which does not contain the multi-character sequence ab, you want to use a negative lookahead:

^(?:(?!ab).)+$


And the above expression disected in regex comment mode is:

(?x)    # enable regex comment mode
^       # match start of line/string
(?:     # begin non-capturing group
  (?!   # begin negative lookahead
    ab  # literal text sequence ab
  )     # end negative lookahead
  .     # any single character
)       # end non-capturing group
+       # repeat previous match one or more times
$       # match end of line/string
Peter Boughton
Thanks Peter, this is a great explanation and it works!
Stuart
A: 

The regex [^(ab)] will match for example 'ab ab ab ab' but not 'ab', because it will match on the string ' a' or 'b '.

What language/scenario do you have? Can you subtract results from the original set, and just match ab?

If you are using GNU grep, and are parsing input, use the '-v' flag to invert your results, returning all non-matches. Other regex tools also have a 'return nonmatch' function, too.

If I understand correctly, you want everything except for those items which contain 'ab' anywhere.

maxwellb
A: 

Simplest way is to pull the negation out of the regular expression entirely:

if (!userName.matches("^([Ss]ys)?admin$")) { ... }

While this is useful if you are consuming *just* that expression, as part of a larger expression the negative lookahead method described by Peter allows both positive and negative conditions in a single string.
Godeke
Absolutely true. But the question was to "match a string which does NOT contain a specific sequence of characters". I think for that purpose negative lookahead is overkill.
+3  A: 

Yes its called negative lookahead. It goes like this - (?!regex here). So abc(?!def) will match abc not followed by def. So it'll match abce, abc, abck, etc.

Similarly there is positive lookahead - (?=regex here). So abc(?=def) will match abc followed by def.

There are also negative and positive lookbehind - (?<!regex here) and (?<=regex here) respectively

One point to note is that the negative lookahead is zero-width. That is, it does not count as having taken any space.

So it may look like a(?=b)c will match "abc" but it won't. It will match 'a', then the positive lookahead with 'b' but it won't move forward into the string. Then it will try to match the 'c' with 'b' which won't work. Similarly ^a(?=b)b$ will match 'ab' and not 'abb' because the lookarounds are zero-width (in most regex implementations).

More information on this page

abhinavg
+1  A: 

In this case I might just simply avoid regular expressions altogether and go with something like:

if (StringToTest.IndexOf("ab") < 0)
  //do stuff

This is likely also going to be much faster (a quick test vs regexes above showed this method to take about 25% of the time of the regex method). In general, if I know the exact string I'm looking for, I've found regexes are overkill. Since you know you don't want "ab", it's a simple matter to test if the string contains that string, without using regex.

patjbs
This is a good point! If the sequence is a simple string then a regex is over-complicating things; a contains/indexOf check is the more sensible option.
Peter Boughton
A: 

I have a similar problem I'm trying to match a string which does NOT contain a specific sequence from group of characters e.g "ab" or "cd" or "ef". Can you help me with the syntax please?

+1  A: 

abc(?!def) will match abc not followed by def. So it'll match abce, abc, abck, etc.

what if I want neither def nor xyz

will it be abc(?!(def)(xyz)) ???

nishant