ansaurus

Question

Regular expression match only if subpattern doesn't match

Answer 1

+1 A:

You could start with something like this:

/\*[^@]

But in general, you don't watch to match C-style comments with regular expressions, because of nasty corner-cases. Consider:

"foo\" /* " " */ "

There's no comment in that code (it's a compile-time concatenation of two string literals), but you're not going to have much luck parsing it without a real parser. (Technically, you could use a regular expression, because you only need a simple finite state machine. But it's a very disgusting regular expression.)

emk 2009-03-25 12:40:31

+1 for pointing out the risky part. I don't think you could use a regular expression to successfully parse a C like language, though. Not even with an extremely ugly one.

Tomalak 2009-03-25 13:00:58

Even though you can't parse arbitrary C code with a regex, you can actually strip comments. I've actually written a state machine before to do this before, and any such state machine can be translated into a regex. But I don't think I could construct it by hand without a lot skull sweat.

emk 2009-03-25 14:37:37

Answer 2

+2 A:

/\*\s*(?!@)(?:(?!\*/).)*\*/

Breaks down as:

/\*               // "/*"
\s*               // optional space
(?!@)             // not followed by "@"
(?:               // don't capture...
   (?!\*/).       // ...anything that is not "*/"
)*                // but match it as often as possible
\*/               // "*/"

Use in "global" and "dotall" mode (e.g. the dot should match new lines as well)

The usual word of warning: As with all parsing jobs that are executed with regular expressions, this will fail on nested patterns and broken input.

emk points out a nice example of (otherwise valid) input that will cause this expression to break. This can't be helped, regex is not for parsing. If you are positive that things like this can never occur in your input, a regex might still work for you.

Tomalak 2009-03-25 12:46:40

Just to be pedantic, \s*(?!@).? doesn't mean what you think it means, but is rather a 0 width negative lookahead. It means that once you have matched as much whitespace as possible (\s*) continue with the match ONLY IF the next character is NOT an @. The .? is unnecessary.

Ant 2009-03-25 13:07:48

Just to be pedantic, how do you suppose I could have written a negative look-ahead without knowing what it is? ;-) You are right about the ".?" being unnecessary, though. I removed it.

Tomalak 2009-03-25 13:17:02

Answer 3

A:

use negative lookahead

M. Utku ALTINKAYA 2009-03-27 04:10:34

ansaurus

tags:

views:

answers:

Regular expression match only if subpattern doesn't match

related questions