ansaurus

Question

regex for that excludes matches within quotes

Answer 1

+1 A:

Assuming the quotes are always paired on a given line, you could create matches before and after for an even number of quotes, and make sure the whole line is matched:

^([^"]*("[^"]*")*[^"]*)*\b(?<!\.)Units(?![_\w(.])\b([^"]*("[^"]*")*[^"]*)*$

this works because the fragment

([^"]*("[^"]*")*[^"]*)*

will only match paired quotes. By adding the begin and end line anchors, it forces the quotes on the left and right side of your regex to be an even count.

This won't handle embedded escaped quotes properly, and multiline quoted strings will be trouble.

Michael Donohue 2009-08-04 18:56:28

OK, some things are missing from this expression: first of all it only matches one character between quotes. You should change the middle `[^"]` to `[^"]*`. And it only matches one pair of quotes before and after `Units`, so the whole paired expression should be wrapped in a group and a Kleene closure: `([^"]*("[^"]*")[^"]*)*`. But even then it doesn't match pathological cases like the one I put in my comment on the question: `"\"" + Number + " (" + Units ")\""`. My point is regular expressions aren't the answer. Just look how complicated this expression is. Software is supposed to be simple.

Welbog 2009-08-04 19:10:13

Updated to reflect the corrections suggested. The pathological cases were already enumerated in my answer, I'm not trying to hide or ignore them, but sometimes the workload doesn't have pathological cases. Building a tokenizer is significantly more work than adding a couple dozen characters to a regex. Software is supposed to be simple.

Michael Donohue 2009-08-04 19:24:56

Good comeback. I don't have full knowledge of every IDE out there, but I'm reasonably sure they already come with parsers (for syntax highlighting) that have already tokenized the code and can easily search for named tokens rather than treating the code as a string. While the code to tokenize code is certainly more complicated than most stand-alone regular expressions, it's conceptually simpler to scan tokenized code than it is to scan code as a set of strings. When I say "software is supposed to be simple", I am referring to the ability to easily understand what is going on at a high level.

Welbog 2009-08-04 19:29:31

Also +1 for this expression. I've tested it and it works as you say it should. This particular answer is fine, and the approach is correct. It's just that regular expressions aren't suited for the general case.

Welbog 2009-08-04 19:31:21

Answer 2

+1 A:

Intellij uses Java regexes, doesn't it? Try this:

(?m)(?<![\w.])Units(?![\w(.])(?=(?:[^\r\n"\\]++|\\.)*+[^\r\n"\\]*+$)

The first part is your regex after a little cosmetic surgery:

(?<![\w.])Units(?![\w(.])

The \b at the beginning and end were effectively the same as a negative lookbehind and a negative lookahead (respectively) for \w, so I folded them into your existing lookarounds. The new lookahead matches the rest of the line if it contains even number (including zero) of unescaped quotation marks:

(?=(?:[^\r\n"\\]++|\\.)*+[^\r\n"\\]*+$)

That handles pathological cases like the one Welbog pointed out, and unlike Michael's regex it will find multiple occurrences of the text the same line. But it doesn't take comments into account. Is Intellij's find/replace feature intelligent enough to disregard text in comments? Come to think of it, doesn't it have some kind of refactoring support built in?

Alan Moore 2009-08-04 19:57:56

+1 because I admire your perseverance in solving the wrong problem with the wrong solution and ending up with an answer that works.

Welbog 2009-08-04 20:05:09

:) Some people wrestle alligators, I wrestle regexes.

Alan Moore 2009-08-04 20:31:04

ansaurus

tags:

views:

answers:

regex for that excludes matches within quotes

related questions