tags:

views:

104

answers:

8

Hi,

being a regex beginner, I need some help writing a regex. It should match a particular pattern, lets say "ABC". But the pattern shouldn't be matched when it is used in comment (' being the comment sign). So XYZ ' ABC shouldn't match. x("teststring ABC") also shouldn't match. But ABC("teststring ' xxx") has to match to end, that is xxx not being cut off. Also does anybody know a free Regex application that you can use to "debug" your regex? I often have problems recognizing whats wrong with my tries. Thanks!

+1  A: 

I find the best 'debugger' for regexes is just messing around in an interactive environment trying lots of small bits out. For Python, ipython is great; for Ruby, irb, for command-line type stuff, sed...

Just try out little pieces at a time, make sure you understand them, then add an extra little bit. Rinse and repeat.

Peter
Be careful of the subtle differences between regex flavors here - what works in Python may not necessarily work in JavaScript.
Chris Lutz
absolutely. hence the different suggestions for different languages. a great point to keep in mind, and another reason for specialized debugging.
Peter
Firebug for JavaScript
Justin Johnson
+4  A: 

On the topic of good regex tools, I really like RegexBuddy, but it's not free.

Other than that, a regex is the wrong tool for the job if you need to check inside string delimiters and all sorts too. You need a finite-state machine.

Matthew Scharley
-1 because I hate RegexBuddy, but +2 for "wrong tool." You win this time, RegexBuddy! \<cheesily generic angry fist shake that no one really does in real life\>
Chris Lutz
Finite state FTW
Justin Johnson
+1 for RegexBuddy (and that a regex is the wrong tool) - it's very nice for testing or breaking down regex's. I have no need for its "magic" features, just the basic ones.
TrueWill
Yup, I often use RegexBuddy to test regular expressions. It is also nice when you don't bother to remember all the various regex operators :p
Svish
thanks, will try that one
noisecoder
+3  A: 

Some will swear by RegexBuddy. I've never used the debugger, but I advise you to steer away from the regex generator it provides. It's just a bad idea.

You may be able to pull this off with whatever regex flavor you're using, but in general I think you're going to find it easier and more maintainable to do this the "hard" way. Regular expressions are for regular languages, and nested anything usually means that regexes aren't a good idea. Modern extensions to regex syntax means it may be doable, but it's not going to be pretty, and you sure won't remember what happened in the morning. And one place where regular expressions fail quite spectacularly (even with modern non-regular extensions) is parsing nested structures - trying to parse any mixture comments, quoted strings, and parenthesis quickly devolves into an incomprehensible and unmaintainable mess. Don't get me wrong - I'm a fan of regular expressions in the right places. This isn't one of them.

Chris Lutz
yeah i noticed that. but i have to stick to the regex way for design issues. so i decided to strip the critical parts away manually code-wise before applying the regex to it.
noisecoder
A: 

Could you clarify? I read it thrice, and I think you want to match a given pattern when it appears as a literal. As in not as part of a comment or a string.

What your asking for is pretty tricky to do as a single regexp. Because you want to skip strings. Multiple strings in one line would complicate matters.

I wouldn't even try to do it in one regexp. Instead, I'd pass each line through a filter first, to remove strings, and then comments in that order. And then try and match your pattern.

In Perl because of it's regexp processing power. Assuming @lines is a list of lines you want to match, and $pattern is the pattern you want to match.

@matches =[];
for (@lines){
  $line = $_;
  $line ~= s/"[^"]*?(?<!\)"//g;
  $line ~= s/'.*//g;
  push @matches, $_ if $line ~= m/$pattern/;
}

The first substitution finds any pattern that starts with a double quotation mark and ends with the first unescaped double quote. Using the standard escape character of a backspace. The next strips comments. If the pattern still matches, it adds that line to the list of matches.

It's not perfect because it can't tell the difference between "a\\" and "a\" The first is usually a valid string, the later is not. Either way these substitutions will continue to look for another ", if one isn't found the string isn't thrown out. We could use another substitution to replace all double backslashes with something else. But this will cause problems if the pattern you're looking for contains a backslash.

EmFi
+1  A: 

For NET development you might as well try RegexDesigner, this tool can generate code(VB/C#) for you. It is a very good tool for us Regex starters.

link text

jerjer
A: 

You can use a zero width look-behind assertion if you only have single line comments, but if you're using multi-line comments, it gets a little trickier.

Ultimately, you really need to solve this kind of issue with some sort of parser, given that the definition of a comment is really driven by a grammar.

This answer to a different but related question looks good too...

John Weldon
A: 

If you have Emacs, there is a built-in regex tool called "regexp-builder". I don't really understand the specifics of your regex question well enough to suggest an answer to that.

Kinopiko
+2  A: 

Odd that lots of people recommend their favorite tools, but nobody provides a solution for the problem at hand. (I'm the developer of RegexBuddy, so I'll refrain from recommending any tools.)

There's no good way of matching Y unless it's part of XYZ with a single regular expression. What you can do is write a regex that matches both Y and XYZ: Y|XYZ. Then use a bit of extra code to process the matches for Y, and ignore those for XYZ. One way to do that is with a capturing group: (Y)|XYZ. Now you can process the matches of the first capturing group. When XYZ matches, the capturing group doesn't match anything.

To do this for your VB-style comments, you can use the regex:

'.*|(ABC)

This regex matches a single quote and everything up to the end of the line, or ABC. This regex will match all comments (whether those include ABC or not). The capturing group will match all occurrences of ABC, except those in comments.

If you want your regex to both skip comments and strings, you can add strings to your regex:

'.*|"[^"\r\n]*"|(ABC)
Jan Goyvaerts
The problem is, what about his case where the apostrophe appears inside a set of double quotes? Then he *doesn't* want it to match. I'm sorry, but regex really is the wrong tool for the job here, even if you did manage to write one to do it.
Matthew Scharley
Also, we're giving suggestions on tools because the OP asked for them. Granted, he asked for free tools, but I felt RegexBuddy was worth plugging anyway (it's not expensive anyway, in terms of software).
Matthew Scharley
Whether regex is the right tool for the job depends on what the job is. E.g. if he's doing a one-time edit in a bunch of text files, a quick regex in a text editor can do the job just fine.
Jan Goyvaerts