ansaurus

Question

Answer 1

+3 A:

Instead of:

(\'[^\']*\'|"[^"]*")

Simply write:

\'([^\']*)\'|"([^"]*)"
  \______/    \_____/
     1           2

Now one of the groups will match the quoted content.

In most flavor, when a group that failed to match is referred to in a replacement string, the empty string gets substituted in, so you can simply replace with $1$2 and one will be the successful capture (depending on the alternate) and the other will substitute in the empty string.

Here's a PHP implementation (as seen on ideone.com):

$text = <<<EOT
"hello", how 'are "you" today'
EOT;

print preg_replace(
  '/\'([^\']*)\'|"([^"]*)"/',
  '$1$2',
  $text
);
# hello, how are "you" today

A closer look

Let's use 1 and 2 for the quotes (for clarity). Whitespaces will also be added (for clarity).

Before, you have, as your second solution, this pattern:

(  1[^1]*1  |  2[^2]*2  )
\_______________________/
   capture whole thing
   content and quotes

As you correctly pointed out, this match a pair of quotes correctly (assuming that you can't escape quotes), but it doesn't capture the content part.

This may not be a problem depending on context (e.g. you can simply trim one character from the beginning and end to get the content), but at the same time, it's also not that hard to fix the problem: simply capture the content from the two possibilities separately.

1([^1]*)1  |  2([^2]*)2
 \_____/       \_____/
 capture contents from
each alternate separately

Now either group 1 or group 2 will capture the content, depending on which alternate was matched. As a "bonus", you can check which quote was used, i.e. if group 1 succeeded, then 1 was used.

Appendix

The […] is a character class. Something like [aeiou] matches one of any of the lowercase vowels. [^…] is a negated character class. [^aeiou] matches one of anything but the lowercase vowels.

(…) is used for grouping. (pattern) is a capturing group and creates a backreference. (?:pattern) is non-capturing.

References

regular-expressions.info/Brackets for capturing, Alternation, Character class, Repetition

polygenelubricants 2010-08-24 21:58:51

Thanks very much! Works great.

Nick 2010-08-24 22:43:27

Answer 2

A:

You cannot do this with a regular expression. This requires an internal state to keep track of (among other things)

Whether or not a previous quote of a certain type has been encountered
Whether or not the "outer" level of quotes is the current level
Whether an "inner" set of quotes has been descended into, and if so, where that set of quotes begins in the string

This requires a grammar-aware parser to do correctly. A regular expression engine does not keep state because it is a finite state automata, which only operates on the current input regardless of previous circumstances.

It's the same reason you cannot reliably match sets of nested parentheses or XML elements.

Jesse Dhillon 2010-08-24 22:01:29

Once you add backreferences, lookahead, and the like, regular expressions are significantly more powerful than finite automata. But I agree that a parser is often a much better choice than a regex for these sorts of tasks.

Jim Lewis 2010-08-24 22:11:32

It depends how you define the lexical structure of string literals. polygenelubricants's definition is a valid one.Keep in mind that regexps in real regexp languages are very different from Regular Expressions in the formal computer science sense, and do not share the limitations of FSA's. See http://en.wikipedia.org/wiki/Regular_expression#Patterns_for_non-regular_languages

LarsH 2010-08-24 22:12:45

Answer 3

+2 A:

Regarding:

Backreferences cannot be used in character classes, so I can't use something like
(\'|")([^\1\r\n]*)\1

(\'|")(((?!(\1|\r|\n)).)*)\1

(where (?!...) is a negative lookahead for ...) should work.

I dont know whether this solves your main problem, but it does solve the "match a character iff it doesnt match a backref" part.

Edit:

Missed a parenthesis, fixed.

David X 2010-08-24 23:10:15

Thanks, useful for future reference.

Nick 2010-08-26 20:15:42

ansaurus

tags:

views:

answers:

Matching quote contents

A closer look

Appendix

References

Edit:

related questions