views:

33

answers:

1

Hi, community.

Matching a string that allows escaping is not that difficult. Look here: http://ad.hominem.org/log/2005/05/quoted_strings.php. For the sake of simplicity I chose the approach, where a string is divided into two "atoms": either a character that is "not a quote or backslash" or a backslash followed by any character.

"(([^"\\]|\\.)*)"

The obvious improvement now is, to allow different quotes and use a backreference.

(["'])((\\.|[^\1\\])*?)\1

Also multiple backslashes are interpreted correctly.

Now to the part, where it gets weird: I have to parse some variables like this (note the missing backslash in the first variable value):

test = 'foo'bar'
var = 'lol'
int = 7

So I wrote quite an expression. I found out that the following part of it does not work as expected (only difference to the above expression is the appended "([\r\n]+)"):

(["'])((\\.|[^\1\\])*?)\1([\r\n]+)

Despite the missing backslash, 'foo'bar' is matched. I used RegExr by gskinner for this (online tool) but PHP (PCRE) has the same behaviour.

To fix this, you can hardcode the quote by replacing the backreferences with '. Then it works as expected. Does this mean the backreference does actually not work in this case? And what does this have to do with the linebreak characters, it worked without it?

+2  A: 

You can't use a backreference inside a character class; \1 will be interpreted as octal 1 in this case (at least in some regex engines, I don't know if this is universally true).

So instead try the following:

(["'])(?:\\.|(?!\1).)*\1(?:[\r\n]+)

or, as a verbose regex:

(["'])       # match a quote
(?:          # either match...
 \\.         # an escaped character
 |           # or
 (?!\1).     # any character except the previously matched quote
)*           # any number of times
\1           # then match the previously matched quote again
(?:[\r\n]+)  # plus one or more linebreak characters.

Edit: Removed some unnecessary parentheses and changed some into non-capturing parentheses.

Your regex insists on finding at least one carriage return after the matched string - why? What if it's the last line of your file? Or if there is a comment or whitespace after the string? You probably should drop that part completely.

Also note that you don't have to make the * lazy for this to work - the regex can't cross an unescaped quote character - and that you don't have to check for backslashes in the second part of the alternation since all backslashes have already been scooped up by the first part of the alternation (?:\\.|(?!\1).). That's why this part has to be first.

Tim Pietzcker