ansaurus

Question

What regular expression can never match?

Answer 1

+1 A:

Maybe this?

/$.+^/

Dan Breen 2009-12-04 05:46:04

In Python, this approach works only if you control the _flags_: `re.compile('$.+^', re.MULTILINE|re.DOTALL).search('a\nb\nc\n')` returns a match object corresponding to the b and c (and all adjacent and in-between newlines). The negative-lookahead approach I recommend works (i.e., fails to match anything) for any combination of flags it could be compiled with.

Alex Martelli 2009-12-04 05:52:12

My bad - mixed up the `$` and `^`.

Chris Lutz 2009-12-04 05:52:40

This may be an attempt to look for the end of a string *before* the beginning, but I've found that the $ doesn't mean 'end of string' unless it's the last character of the regex, and I expect a similar behaviour applies to ^, so this might match a substring starting with a literal $, and ending with a literal ^

pavium 2009-12-04 05:56:44

@pavium, it certainly doesn't behave that way in Python or Javascript. Unless you escape them with \ or include them in a character set with [], special characters like $ and ^ should not be treated as literals. In what language did you observe this?

Peter Hansen 2009-12-04 21:17:20

Answer 2

A:

What about instead of regex, just use an always false if statement? In javascript:

var willAlwaysFalse=false;
if(willAlwaysFalse)
{
}
else
{
}

Ngu Soon Hui 2009-12-04 05:46:13

I added a comment in reply to Charlie's question, explaining why this sort of approach isn't desirable. In short, I need a group inside a regex that will always be used, but in some cases the group must be built to ensure it can never match.

Peter Hansen 2009-12-04 06:12:32

Answer 3

+19 A:

Leverage negative lookahead:

>>> import re
>>> x=r'(?!x)x'
>>> r=re.compile(x)
>>> r.match('')
>>> r.match('x')
>>> r.match('y')

this RE is a contradiction in terms and therefore will never match anything.

Alex Martelli 2009-12-04 05:46:36

Well done, sir! Yours is a lot less hacky than mine. +1

Chris Lutz 2009-12-04 05:48:55

Thanks @Chris! I actually once, in real life, had to help somebody (who fancied themselves as a leet regex hacker) who had gotten their lookaheads in a twist and ended up with a more complicated version of such a "contradiction in terms" regex...;-).

Alex Martelli 2009-12-04 05:54:56

Minor variations, for completeness: `r'(?=x)[^x]'` `r'x(?<!x)'` `r'[^x](?<=x)'`

Chris Lutz 2009-12-04 06:01:38

@Chris, yep -- also, `(?=x)(?!x)` and so on (concatenations of contradictory lookaheads, and same for lookbehinds), and many of those also work for arbitrary values of `x` (lookbehinds need `x`s that match strings of fixed-length).

Alex Martelli 2009-12-04 06:18:03

Appears to work well. But what about just (?!) instead? Since () will always match, wouldn't (?!) be guaranteed never to match?

Peter Hansen 2009-12-04 06:31:00

@Peter, yes, if Python accepts that syntax (and recent releases appear to), then it would be self-contradictory as well. Another idea (not quite as elegant, but the more ideas you get the likelier you are to find one working across all RE engines of interest): `r'a\bc'`, looking for a word-boundary immediately surrounded by letters on both sides (variant: nonword characters on both sides).

Alex Martelli 2009-12-04 15:17:52

(?!) does seem to work with Python, but not with Javascript (in Firefox 3.5). Modifying it to (?!()) works for both. I'm not sure I'd want to rely on it in JS though. Performance could also be a consideration... some of these may be relatively slow, if they test at every character, while others may short-circuit the test.

Peter Hansen 2009-12-04 20:24:07

Interestingly, my original with a simple literal that I "know" won't appear in my input turns out to be fastest, in Python. With a 5MB input string, and using this in a sub() operation, (?!x)x takes 21% longer, (?!()) is 16%, and ($^) 6% longer. May be significant in some cases, though not in mine.

Peter Hansen 2009-12-04 21:23:51

Answer 4

+1 A:

Python won't accept it, but Perl will:

perl -ne 'print if /(w\1w)/'

This regex should (theoretically) try to match an infinite (even) number of ws, because the first group (the ()s) recurses into itself. Perl doesn't seem to be issuing any warnings, even under use strict; use warnings;, so I assume it's at least valid, and my (minimal) testing fails to match anything, so I submit it for your critique.

Chris Lutz 2009-12-04 05:48:13

Theory is always nice, but in practice I think I'd be worried about regular expressions whose descriptions included the word "infinite"!

Peter Hansen 2009-12-04 20:37:14

Answer 5

+6 A:

$.

.^

$.^

(?!)

Knio 2009-12-04 05:52:57

Cute! My subconscious steered me away from ideas like the first three, as they're "illegal"... conceptually, but obviously not to the regex. I don't recognize the (!) one... will have to look that one up.

Peter Hansen 2009-12-04 06:11:30

The latter one must either be a typo, or maybe you were simply expressing parenthical excitement :-). Or did you mean (?!) instead, which is similar to Alex's answer.

Peter Hansen 2009-12-04 06:33:11

oops - yes I meant `(?!)`

Knio 2009-12-04 06:48:46

Okay then, I like the (?!) answer... effectively what Alex suggested. Note that in http://stackoverflow.com/questions/1723182 (pointed out by Amarghosh above) someone claims "some flavours" of regex would consider that a syntax error. Python likes it fine though. Note that your other suggestions would all fail with re.DOTALL|re.MULTILINE modes in Python.

Peter Hansen 2009-12-04 14:14:45

Has this been tested? I would have assumed that `^` only has special meaning as the first character of a regexp, and `$` only has special meaning at the end of a regexp, unless the regular expression is a multi-line expression.

PP 2010-02-20 17:35:23

Answer 6

+2 A:

The fastest will be:

r = re.compile(r'a^')
r.match('whatever')

'a' can be any non-special character ('x','y'). Knio's implementation might be a bit more pure but this one will be faster for all strings not starting with whatever character you choose instead of 'a' because it will not match after the first character rather than after the second in those cases.

Adam Nelson 2009-12-04 21:35:51

Indeed, (.^) would be roughly 10% slower than (\x00^) in my case.

Peter Hansen 2009-12-04 22:14:27

I'm accepting this, since using any value other than \n as the character is guaranteed never to match, and I see it as slightly more readable (given that relatively few people are regex experts) than the (?!x)x option, though I voted that one up too. In my case, for either option I would need a comment to explain it, so I think I'll just adjust my original attempt to '\x00NEVERMATCHES^'. I get the no-match guarantee of this answer, with my original self-documenting-ness. Thanks to all for answers!

Peter Hansen 2009-12-06 22:15:49

Does this actually work and if so, who decided to break with Unix? In Unix regexps, `^` is special only as the first character and similarly with `$`. With any Unix tool, that regexp is going to match anything containing the literal string `a^`.

jk 2009-12-19 21:11:31

Heh, that's a good attack. I never tested against that literal string.

Adam Nelson 2010-01-07 17:15:15

Answer 7

+2 A:

Perl 5.10 supports special control words called "verbs", which is enclosed in (*...) sequence. (Compare with (?...) special sequence.) Among them, it includes (*FAIL) verb which returns from the regular expression immediately.

Note that verbs are also implemented in PCRE shortly after, so you can use them in PHP or other languages using PCRE library too. (You cannot in Python or Ruby, however. They use their own engine.)

Kang Seonghoon 2009-12-05 06:53:06

The docs for that at http://perldoc.perl.org/perlre.html#%28%2AFAIL%29-%28%2AF%29 say "This pattern matches nothing and always fails. It is equivalent to (?!), but easier to read. In fact, (?!) gets optimised into (*FAIL) internally." Interesting, as (?!) is my favourite "pure" answer so far (even though it doesn't work in Javascript). Thanks.

Peter Hansen 2009-12-06 02:57:39

Answer 8

A:

I believe that

\Z RE FAILS! \A

covers even the cases where the regular expression includes flags like MULTILINE, DOTALL etc.

>>> import re
>>> x=re.compile(r"\Z RE FAILS! \A")
>>> x.match('')
>>> x.match(' RE FAILS! ')
>>>

I believe (but I haven't benchmarked it) that whatever the length (> 0) of the string between \Z and \A, the time-to-failure should be constant.

ΤΖΩΤΖΙΟΥ 2009-12-10 18:18:38

Answer 9

+1 A:

[^\d\D] or (?=a)b or a$a or a^a

Bart Kiers 2009-12-19 20:16:08

Thanks. Note that (?!x)x was the first answer given, listed above.

Peter Hansen 2009-12-19 20:45:40

Yes, it seemed I scanned the other answerers too quickly.

Bart Kiers 2009-12-19 20:50:30

Answer 10

+3 A:

One that was missed:

^\b$

It can't match because the empty string doesn't contain a word boundary. Tested in Python 2.5.

Mark Byers 2010-02-20 17:29:07

ansaurus

tags:

views:

answers:

What regular expression can never match?

related questions