views:

101

answers:

4

Please explain why the expression makes sense if it is complicated.

+1  A: 

If you are actually using grep, you could use the -v option to select only the lines that don't match:

grep -v \(cat\|dog\|fish\|^$\)

The pattern will select empty lines and lines containing "cat", "dog" and "fish".

Okay, you're not using grep. According to http://www.regular-expressions.info/refadv.html , if your regex engine supports it, you want ?!:

`(?!regex)` Zero-width negative lookahead. Identical to positive lookahead, except that the overall match will only succeed if the regex inside the lookahead fails to match. `t(?!s)` matches the first `t` in `streets`.
Jack Kelly
Thanks Jack. I was aware of that, but I am looking for a regular expression independent of grep.
S13
Then why is your question tagged grep?
Stephen
A: 

Let's explore how we can build up a pattern which excludes specific phrases.

We'll start with a simple .*, which matches any character (using the dot), zero or more times (star). This pattern will match any string, including an empty string1.

However, since there are specific phrases we don't want to match, we can try to use a negative lookaround to stop it from matching what we don't want. A lookaround is a zero-width assertion, which means that the regex engine needs to satisfy the assertion for there to be a match, but the assertion does not consume any characters (or in other words, it doesn't advance the position in the string). In this specific case, we will use a lookahead, which tells the regex engine to look ahead of the current position to match the assertion (there are also lookbehinds, which, naturally, look behind the current position). So we'll try (?!cat|dog|fish).*.

When we try this pattern against catdogfish, though, it matches atdogfish! What's going on here? Let's take a look at what happens when the engine tries to use our pattern on catdogfish.

The engine works from left to right, starting from before the first character in our string. On it's first attempt, the lookahead asserts that the next characters from that point are not cat, dog, or fish, but since they actually are cat, the engine cannot match from this point, and advances to before the second character. Here the assertion succeeds, because the next characters following do not satisfy the assertion (atf does not match cat or dog and atfi does not match fish). Now that the assertion succeeds, the engine can match .*, and since by default regular expressions are greedy (which means that they will capture as much of your string as possible), the dot-star will consume the rest of the string.

You might be wondering why the lookaround isn't checked again after the first assertion succeeds. That is because the dot-star is taken as one single token, with the lookaround working on it as a whole. Let's change that so that the lookaround asserts once per repetition: (?:(?!cat|dog|fish).)*.

The (?:…) is called a non-capturing group. In general, things in regular expressions are grouped by parentheses, but these parentheses are capturing, which means that the contents are saved into a backreference (or submatch). Since we don't need a submatch here, we can use a non-capturing group, which works the same as a normal group, but without the overhead of keeping track of a backreference.

When we run our new pattern against catdogfish, we now get three matches2: at, og and ish! Let's take a look at what's going on this time inside the regex engine.

Again the engine starts before the first character. It enters the group that will be repeated ((?!cat|dog|fish).) and sees that the assertion fails, so moves onto the next position (a). The assertion succeeds, and the engine moves forwards to t. Again the assertion succeeds, and the engine moves forwards again. At this point, the assertion fails (because the next three characters are dog), and the engine returns at as a match, because that is the biggest string (so far, and the engine works from left to right), that matches the pattern.

Next, even though we've already got a match, the engine will continue. It will move forwards to the next character (o), and again pick up two characters that match the pattern (og). Finally, the same thing will happen for the ish at the end of the string. Once the engine hits the end of the string, there is nothing more for it to do, and it returns the three matches it picked up.

So this pattern still isn't perfect, because it will match parts of a string that contain our disallowed phrases. In order to prevent this, we need to introduce anchors into our pattern: ^(?:(?!cat|dog|fish).)*$

Anchors are also zero-width assertions, that assert that the position the engine is in must be a specific location in the string. In our case, ^ matches the beginning of the string, and $ matches the end of the string. Now when we match our pattern against catdogfish, none of those small matches can be picked up anymore, because none of them match the anchor positions.

So the final expression would be ^(?:(?!cat|dog|fish).)*$.


1 However, the dot doesn't match newline characters by default, unless the /s (or "single line") modifier is enabled on the regex.
2 I'm making the assumption here that the pattern is working in "global" mode, which makes the pattern match as many times as possible. Without global mode, the pattern would only return the first match, at.

Daniel Vandersluis
A: 

Thank you for taking the time to write that response Daniel. I actually asked that question, but I just created an account so now my uname is different. I am picking a book up on the topic today at the library, but I still can't get this to work. With your expression and a file called words containing several words including cat, dog, fish, shouldn't this command work?

grep ^(?:(?!cat|dog|fish).)*$ words

I keep getting bash: !from: event not found

Hopefully this is my misunderstanding. Thanks for everyone's patience.

You need to place the regexp in single quotes if you're running grep from bash. Also, I am not sure if grep understands advanced RE like that
Arkadiy
A: 

Normally it's better to leave the negation to the code "around" regexp - such as -v switch in grep or !~ in perl. Is there a particular problem you're trying to solve, or its it just an exercise?

Arkadiy
How can I capture all nonempty sequences of letters other than cat, dog, fish using a regular expression? I know I could do it with grep -v, but is it possible to create a regular expression that represents that? I thought Daniel had it above, but I couldn't get it to work.
S13
"Doctor, it hurts when I do this!" "Well, then don't". On a more serious note - what is the problem with Daniel's solution? It should work for one-per-line words...
Arkadiy