tags:

views:

47

answers:

3

I am trying to remove quotes from a string. Example:

"hello", how 'are "you" today'

returns

hello, how are "you" today

I am using php preg_replace.

I've got a couple of solutions at the moment:

(\'|")(.*)\1

Problem with this is it matches all characters (including quotes) in the middle, so the result ($2) is

hello", how 'are "you today'

Backreferences cannot be used in character classes, so I can't use something like

(\'|")([^\1\r\n]*)\1

to not match the first backreference in the middle.

Second solution:

(\'[^\']*\'|"[^"]*")

Problem is, this includes the quotes in the back reference so doesn't actually do anything at all. The result ($1):

"hello", how 'are "you" today'
+3  A: 

Instead of:

(\'[^\']*\'|"[^"]*")

Simply write:

\'([^\']*)\'|"([^"]*)"
  \______/    \_____/
     1           2

Now one of the groups will match the quoted content.

In most flavor, when a group that failed to match is referred to in a replacement string, the empty string gets substituted in, so you can simply replace with $1$2 and one will be the successful capture (depending on the alternate) and the other will substitute in the empty string.

Here's a PHP implementation (as seen on ideone.com):

$text = <<<EOT
"hello", how 'are "you" today'
EOT;

print preg_replace(
  '/\'([^\']*)\'|"([^"]*)"/',
  '$1$2',
  $text
);
# hello, how are "you" today 

A closer look

Let's use 1 and 2 for the quotes (for clarity). Whitespaces will also be added (for clarity).

Before, you have, as your second solution, this pattern:

(  1[^1]*1  |  2[^2]*2  )
\_______________________/
   capture whole thing
   content and quotes

As you correctly pointed out, this match a pair of quotes correctly (assuming that you can't escape quotes), but it doesn't capture the content part.

This may not be a problem depending on context (e.g. you can simply trim one character from the beginning and end to get the content), but at the same time, it's also not that hard to fix the problem: simply capture the content from the two possibilities separately.

1([^1]*)1  |  2([^2]*)2
 \_____/       \_____/
 capture contents from
each alternate separately

Now either group 1 or group 2 will capture the content, depending on which alternate was matched. As a "bonus", you can check which quote was used, i.e. if group 1 succeeded, then 1 was used.


Appendix

The […] is a character class. Something like [aeiou] matches one of any of the lowercase vowels. [^…] is a negated character class. [^aeiou] matches one of anything but the lowercase vowels.

(…) is used for grouping. (pattern) is a capturing group and creates a backreference. (?:pattern) is non-capturing.

References

polygenelubricants
Thanks very much! Works great.
Nick
A: 

You cannot do this with a regular expression. This requires an internal state to keep track of (among other things)

  • Whether or not a previous quote of a certain type has been encountered
  • Whether or not the "outer" level of quotes is the current level
  • Whether an "inner" set of quotes has been descended into, and if so, where that set of quotes begins in the string

This requires a grammar-aware parser to do correctly. A regular expression engine does not keep state because it is a finite state automata, which only operates on the current input regardless of previous circumstances.

It's the same reason you cannot reliably match sets of nested parentheses or XML elements.

Jesse Dhillon
Once you add backreferences, lookahead, and the like, regular expressions are significantly more powerful than finite automata. But I agree that a parser is often a much better choice than a regex for these sorts of tasks.
Jim Lewis
It depends how you define the lexical structure of string literals. polygenelubricants's definition is a valid one.Keep in mind that regexps in real regexp languages are very different from Regular Expressions in the formal computer science sense, and do not share the limitations of FSA's. See http://en.wikipedia.org/wiki/Regular_expression#Patterns_for_non-regular_languages
LarsH
+2  A: 

Regarding:

Backreferences cannot be used in character classes, so I can't use something like

(\'|")([^\1\r\n]*)\1
(\'|")(((?!(\1|\r|\n)).)*)\1

(where (?!...) is a negative lookahead for ...) should work.

I dont know whether this solves your main problem, but it does solve the "match a character iff it doesnt match a backref" part.

Edit:

Missed a parenthesis, fixed.

David X
Thanks, useful for future reference.
Nick