tags:

views:

676

answers:

10

Several times recently I've wanted to create a regular expression that will never match any input (even an empty string). Some of the apparently most obvious ways are either very crude, turn out to have undesirable behaviour... or just plain don't work!

My provisional solution, which works fine for my particular case is simply a pattern of '\x00NEVERMATCHES\x00' (Python syntax for a string bracketed by two NUL bytes). It's self-documenting and, given the binary content, can never match my text-only input. What about the case where the input has no such restrictions?

A reading of the Javascript and Python regular expression docs didn't bring anything simple to mind. What would be your approach to this?

Solutions in any fairly generic regex syntax would be fine. My particular uses would be in Javascript and Python, but their regex syntaxes are pretty typical so I won't limit the answers unnecessarily at this point.

+1  A: 

Maybe this?

/$.+^/
Dan Breen
In Python, this approach works only if you control the _flags_: `re.compile('$.+^', re.MULTILINE|re.DOTALL).search('a\nb\nc\n')` returns a match object corresponding to the b and c (and all adjacent and in-between newlines). The negative-lookahead approach I recommend works (i.e., fails to match anything) for any combination of flags it could be compiled with.
Alex Martelli
My bad - mixed up the `$` and `^`.
Chris Lutz
This may be an attempt to look for the end of a string *before* the beginning, but I've found that the $ doesn't mean 'end of string' unless it's the last character of the regex, and I expect a similar behaviour applies to ^, so this might match a substring starting with a literal $, and ending with a literal ^
pavium
@pavium, it certainly doesn't behave that way in Python or Javascript. Unless you escape them with \ or include them in a character set with [], special characters like $ and ^ should not be treated as literals. In what language did you observe this?
Peter Hansen
A: 

What about instead of regex, just use an always false if statement? In javascript:

var willAlwaysFalse=false;
if(willAlwaysFalse)
{
}
else
{
}
Ngu Soon Hui
I added a comment in reply to Charlie's question, explaining why this sort of approach isn't desirable. In short, I need a group inside a regex that will always be used, but in some cases the group must be built to ensure it can never match.
Peter Hansen
+19  A: 

Leverage negative lookahead:

>>> import re
>>> x=r'(?!x)x'
>>> r=re.compile(x)
>>> r.match('')
>>> r.match('x')
>>> r.match('y')

this RE is a contradiction in terms and therefore will never match anything.

Alex Martelli
Well done, sir! Yours is a lot less hacky than mine. +1
Chris Lutz
Thanks @Chris! I actually once, in real life, had to help somebody (who fancied themselves as a leet regex hacker) who had gotten their lookaheads in a twist and ended up with a more complicated version of such a "contradiction in terms" regex...;-).
Alex Martelli
Minor variations, for completeness: `r'(?=x)[^x]'` `r'x(?<!x)'` `r'[^x](?<=x)'`
Chris Lutz
@Chris, yep -- also, `(?=x)(?!x)` and so on (concatenations of contradictory lookaheads, and same for lookbehinds), and many of those also work for arbitrary values of `x` (lookbehinds need `x`s that match strings of fixed-length).
Alex Martelli
Appears to work well. But what about just (?!) instead? Since () will always match, wouldn't (?!) be guaranteed never to match?
Peter Hansen
@Peter, yes, if Python accepts that syntax (and recent releases appear to), then it would be self-contradictory as well. Another idea (not quite as elegant, but the more ideas you get the likelier you are to find one working across all RE engines of interest): `r'a\bc'`, looking for a word-boundary immediately surrounded by letters on both sides (variant: nonword characters on both sides).
Alex Martelli
(?!) does seem to work with Python, but not with Javascript (in Firefox 3.5). Modifying it to (?!()) works for both. I'm not sure I'd want to rely on it in JS though. Performance could also be a consideration... some of these may be relatively slow, if they test at every character, while others may short-circuit the test.
Peter Hansen
Interestingly, my original with a simple literal that I "know" won't appear in my input turns out to be fastest, in Python. With a 5MB input string, and using this in a sub() operation, (?!x)x takes 21% longer, (?!()) is 16%, and ($^) 6% longer. May be significant in some cases, though not in mine.
Peter Hansen
+1  A: 

Python won't accept it, but Perl will:

perl -ne 'print if /(w\1w)/'

This regex should (theoretically) try to match an infinite (even) number of ws, because the first group (the ()s) recurses into itself. Perl doesn't seem to be issuing any warnings, even under use strict; use warnings;, so I assume it's at least valid, and my (minimal) testing fails to match anything, so I submit it for your critique.

Chris Lutz
Theory is always nice, but in practice I think I'd be worried about regular expressions whose descriptions included the word "infinite"!
Peter Hansen
+6  A: 

$.

.^

$.^

(?!)

Knio
Cute! My subconscious steered me away from ideas like the first three, as they're "illegal"... conceptually, but obviously not to the regex. I don't recognize the (!) one... will have to look that one up.
Peter Hansen
The latter one must either be a typo, or maybe you were simply expressing parenthical excitement :-). Or did you mean (?!) instead, which is similar to Alex's answer.
Peter Hansen
oops - yes I meant `(?!)`
Knio
Okay then, I like the (?!) answer... effectively what Alex suggested. Note that in http://stackoverflow.com/questions/1723182 (pointed out by Amarghosh above) someone claims "some flavours" of regex would consider that a syntax error. Python likes it fine though. Note that your other suggestions would all fail with re.DOTALL|re.MULTILINE modes in Python.
Peter Hansen
Has this been tested? I would have assumed that `^` only has special meaning as the first character of a regexp, and `$` only has special meaning at the end of a regexp, unless the regular expression is a multi-line expression.
PP
+2  A: 

The fastest will be:

r = re.compile(r'a^')
r.match('whatever')

'a' can be any non-special character ('x','y'). Knio's implementation might be a bit more pure but this one will be faster for all strings not starting with whatever character you choose instead of 'a' because it will not match after the first character rather than after the second in those cases.

Adam Nelson
Indeed, (.^) would be roughly 10% slower than (\x00^) in my case.
Peter Hansen
I'm accepting this, since using any value other than \n as the character is guaranteed never to match, and I see it as slightly more readable (given that relatively few people are regex experts) than the (?!x)x option, though I voted that one up too. In my case, for either option I would need a comment to explain it, so I think I'll just adjust my original attempt to '\x00NEVERMATCHES^'. I get the no-match guarantee of this answer, with my original self-documenting-ness. Thanks to all for answers!
Peter Hansen
Does this actually work and if so, who decided to break with Unix? In Unix regexps, `^` is special only as the first character and similarly with `$`. With any Unix tool, that regexp is going to match anything containing the literal string `a^`.
jk
Heh, that's a good attack. I never tested against that literal string.
Adam Nelson
+2  A: 

Perl 5.10 supports special control words called "verbs", which is enclosed in (*...) sequence. (Compare with (?...) special sequence.) Among them, it includes (*FAIL) verb which returns from the regular expression immediately.

Note that verbs are also implemented in PCRE shortly after, so you can use them in PHP or other languages using PCRE library too. (You cannot in Python or Ruby, however. They use their own engine.)

Kang Seonghoon
The docs for that at http://perldoc.perl.org/perlre.html#%28%2AFAIL%29-%28%2AF%29 say "This pattern matches nothing and always fails. It is equivalent to (?!), but easier to read. In fact, (?!) gets optimised into (*FAIL) internally." Interesting, as (?!) is my favourite "pure" answer so far (even though it doesn't work in Javascript). Thanks.
Peter Hansen
A: 

I believe that

\Z RE FAILS! \A

covers even the cases where the regular expression includes flags like MULTILINE, DOTALL etc.

>>> import re
>>> x=re.compile(r"\Z RE FAILS! \A")
>>> x.match('')
>>> x.match(' RE FAILS! ')
>>>

I believe (but I haven't benchmarked it) that whatever the length (> 0) of the string between \Z and \A, the time-to-failure should be constant.

ΤΖΩΤΖΙΟΥ
+1  A: 

[^\d\D] or (?=a)b or a$a or a^a

Bart Kiers
Thanks. Note that (?!x)x was the first answer given, listed above.
Peter Hansen
Yes, it seemed I scanned the other answerers too quickly.
Bart Kiers
+3  A: 

One that was missed:

^\b$

It can't match because the empty string doesn't contain a word boundary. Tested in Python 2.5.

Mark Byers