views:

477

answers:

3

Hello

I'm trying to match C style comments form a file, but only if the comment don't start with a certain labels introduced by @

For example from

/* some comment to match */
/* another comment.
this should match also */
/*@special shouldn't match*/

Is this possible using regular expressions only?

I'm trying this using JavaScript implementation of regular expressions.

+1  A: 

You could start with something like this:

/\*[^@]

But in general, you don't watch to match C-style comments with regular expressions, because of nasty corner-cases. Consider:

"foo\" /* " " */ "

There's no comment in that code (it's a compile-time concatenation of two string literals), but you're not going to have much luck parsing it without a real parser. (Technically, you could use a regular expression, because you only need a simple finite state machine. But it's a very disgusting regular expression.)

emk
+1 for pointing out the risky part. I don't think you could use a regular expression to successfully parse a C like language, though. Not even with an extremely ugly one.
Tomalak
Even though you can't parse arbitrary C code with a regex, you can actually strip comments. I've actually written a state machine before to do this before, and any such state machine can be translated into a regex. But I don't think I could construct it by hand without a lot skull sweat.
emk
+2  A: 
/\*\s*(?!@)(?:(?!\*/).)*\*/

Breaks down as:

/\*               // "/*"
\s*               // optional space
(?!@)             // not followed by "@"
(?:               // don't capture...
   (?!\*/).       // ...anything that is not "*/"
)*                // but match it as often as possible
\*/               // "*/"

Use in "global" and "dotall" mode (e.g. the dot should match new lines as well)

The usual word of warning: As with all parsing jobs that are executed with regular expressions, this will fail on nested patterns and broken input.

emk points out a nice example of (otherwise valid) input that will cause this expression to break. This can't be helped, regex is not for parsing. If you are positive that things like this can never occur in your input, a regex might still work for you.

Tomalak
Just to be pedantic, \s*(?!@).? doesn't mean what you think it means, but is rather a 0 width negative lookahead. It means that once you have matched as much whitespace as possible (\s*) continue with the match ONLY IF the next character is NOT an @. The .? is unnecessary.
Ant
Just to be pedantic, how do you suppose I could have written a negative look-ahead without knowing what it is? ;-) You are right about the ".?" being unnecessary, though. I removed it.
Tomalak
A: 

use negative lookahead

M. Utku ALTINKAYA