tags:

views:

154

answers:

3

One of my homework questions asked to develop a regex for all strings over x,y,z that did not contain xxx

After doing some reading I found out about negative lookahead and made this which works great:

(x(?!xx)|y|z)*

Still, in the spirit of completeness, is there anyway to write this without negative lookahead?

Reading I have done makes me think it can be done with some combination of carets (^), but I cannot get the right combination so I am not sure.

Taking it a step further, is it possible to exclude a string like xxx using only the or (|) operator, but still check the strings in a recursive fashion?

EDIT 9/6/2010:

Think I answered my own question. I messed with this some more, trying make this regex with only or (|) statements and I am pretty sure I figured it out... and it isn't nearly as messy as I thought it would be. If someone else has time to verify this with a human eye I would appreciate it.

(xxy|xxz|xy|xz|y|z)*(xxy|xxz|xx|xy|xz|x|y|z)

+5  A: 

Try this:

^(x{0,2}(y|z|$))*$

The basic idea is this: for match at most 2 X's, followed by another letter or the end of the string.

When you reach a point where you have 3 X's, the regex has no rule that allows it to keep matching, and it fails.

Working example: http://rubular.com/r/ePH0fHlZxL

A less compact way to write the same is (with free spaces, usually the /x flag):

^(
y|         # y is ok
z|         # so is z
x(y|z|$)|  # a single x, not followed by x
xx(y|z|$)  # 2 x's, not followed by x
)*$

Based on the latest edit, here's an ever flatter version of the pattern: I'm not entirely sure I understand your fascination with the pipe, but you can eliminate some more options - by allowing an empty match on the second group you don't need to repeat permutations from the first group. That regex also allows ε, which I think is included in your language.

^(xxy|xxz|xy|xz|y|z)*(xx|x|)$
Kobi
Yeah this works! Kobi you seem to be the regex man, do you think this can be done with only `or` `(|)` operators? I can get close, but I cannot account for a single `x` :)
typoknig
@typoknig - I posted a more literal way to match your language. There are many more between them, like `^(x|y|xx?(y|z|$))*$`
Kobi
+2  A: 

I know you don't want to use lookahead, but here's another way to solve this:

^(?:(?!xxx)[xyz])*$

will match any line of characters x, y or z as long as it doesn't contain the string xxx.

Tim Pietzcker
Excellent, there are plenty of options for sure. I came across a question here at SO once (but I cannot find it now) where someone was asking a similar question to mine because negative lookahead would not work on their compiler or something like that. Does that sound right?
typoknig
Usually it's *lookbehind* that's missing (most notable examples being JavaScript and Ruby until 1.8); most regex flavors have lookahead nowadays except POSIX/GNU BRE/ERE engines.
Tim Pietzcker
+2  A: 

Basically you have the right answer already - well done you. :)

Carat (^) in a set [^abc] will only match where it does not find a character in that set so it's application for matching orders of characters (i.e. strings) is limited and weak.

Regex has numeric quantifiers {n} and {a,b} which allow you to match a defined number of repititions of a pattern, which would work for this specific pattern (because it's 'x' repeated) but it's not particularily expressive of the problem you're trying to solve (even for regex!) and is a bit brittle (it wouldn't be appropriate for negative match 'xyx' for example.

An or pattern again would be verbose and rather unexpressive but it could be done as the fragment:

(x|xx)[^x] // x OR xx followed by NOT x

Obviously you can do this with an iterative algorithm but that's highly inefficient compared to a regex.

Well done for thinking beyond the solution though.

annakata
`(x|xx)[^x]` will fail at the end of the string, though.
Tim Pietzcker
I think your example is close to what I was thinking... I'm gonna play with it. So lets say you had a really really long regex, extreme verbosity, would it be possible to exclude a string based on only `or` `(|)` operators or is it an impossibility?
typoknig
@Tim Pietzcker - sure it's just a fragment, it changes with context @typoknig - ad absurdum any pattern is expressable as a series of ORs, you just OR every absolute possible match together, it's just that at that point it's no longer a pattern you're representing. But hypothetically yes it's always possible.
annakata