tags:

views:

383

answers:

3

I'm writing a method to lift certain data out of an SQL query string, and I need to regex match any word inside of curly braces ONLY when it appears outside of single-quotes. I also need it to factor in the possibility of escaped (preceded by backslash) quotes, as well as escaped backslashes.

In the following examples, I need the regex to match {FOO} and not {BAR}:

blah blah {FOO} blah 'I\'m typing {BAR} here with an escaped backslash \\'
blah blah {FOO} 'Three backslashes {BAR} and an escaped quote \\\\\\\' here {BAR}'

I'm using preg_match in PHP to get the word in the braces ("FOO", in this case). Here's the regex string I have so far:

$regex = '/' .
 // Match the word in braces
 '\{(\w+)\}' .
 // Only if it is followed by an even number of single-quotes
 '(?=(?:[^\']*\'[^\']*\')*[^\']*$)' .
 // The end
 '/';

My logic is that, since the only thing I'm parsing is a legal SQL string (besides the brace-thing I added), if a set of braces is followed by an even number of non-escaped quotes, then it must be outside of quotes.

The regex I provided is 100% successful EXCEPT for taking escaped quotes into consideration. I just need to make sure there is no odd number of backslashes before a quote match, but for the life of me I can't seem to convey this in RegEx. Any takers?

+1  A: 

The way to deal with escaped quotes and backslashes is to consume them in matched pairs.

(?=(?:(?:(?:[^\'\\]++|\\.)*+\'){2})*+(?:[^\'\\]++|\\.)*+$)

In other words, as you scan for the next quote, you skip any pair of characters that starts with a backslash. That takes care of both escaped quotes and escaped backslashes. This lookahead will allow escaped characters outside of quoted sections, which probably isn't necessary, but it probably won't hurt either.

p.s., Notice the liberal use of possessive quantifiers (*+ and ++); without those you could have performance problems, especially if the target strings are large. Also, if the strings can contain line breaks, you may need to do the matching in DOTALL mode (aka, "singleline" or "/s" mode).

However, I agree with mmyers: if you're trying to parse SQL, you will run into problems that regexes can't handle at all. Of all the things that regexes are bad at, SQL is one of the worst.

Alan Moore
Ingenious work, Alan! Thanks very much -- this is exactly what I was looking for. As I mentioned in reply to mmyers, though, I'm definitely not trying to parse the SQL language with this. All I needed was to lift out those bracketed items, with no further processing. For arbitrary length strings, preg_match is definitely the cleaner, and most likely less expensive, method.
Tom Frost
A: 

Simply, and perhaps naively, str_replace out all your double backslashes. Then str_replace out escaped single quotes. At that point it's relatively simple to find matches that are not between single quotes (using your existing regex, for example).

Unfortunately with this solution, you need to replace those instances with a character(s) that you can be positive won't appear in the rest of your string. In this case, I can make no such assumptions.
Tom Frost
A: 

If you really want to use regular expressions for this, I would do it in two steps:

  1. Separate the strings from the non-strings with preg_split:

    $re = "('(?:[^\\\\']+|\\\\(\\\\\\\\)*.)*')";
    $parts = preg_split('/'.$re.'/', $str, -1, PREG_SPLIT_NO_EMPTY | PREG_SPLIT_DELIM_CAPTURE);
    
  2. Replace the whatever in the strings:

    foreach ($parts as $key => $val) {
        if (preg_match('/^'.$re.'$/', $val)) {
            $parts[$key] = preg_replace('/\{([^}]*)}/', '$1', $val);
        }
    }
    

But a real parser would probably be better as this approach is not that efficient.

Gumbo