views:

588

answers:

6

Hi,

I'm trying to write a regex that will match everything BUT an apostrophe that has not been escaped. Consider the following:

<?php $s = 'Hi everyone, we\'re ready now.'; ?>

My goal is to write a regular expression that will essentially match the string portion of that. I'm thinking of something such as

/.*'([^']).*/

in order to match a simple string, but I've been trying to figure out how to get a negative lookbehind working on that apostrophe to ensure that it is not preceded by a backslash...

Any ideas?

- JMT

+1  A: 
/.*'([^'\\]|\\.)*'.*/

The parenthesized portion looks for non-apostrophes/backslashes and backslash-escaped characters. If only certain characters can be escaped change the \\. to \\['\\a-z], or whatever.

John Kugelman
Very nearly, but that doesn't handle the pathological case...'My string ends with with a backslash\\'
the.jxc
Thanks John! Fortunately for me, the cases I'm going to have to deal with can be restrained, and will never reach the problem that the.jxc describes. Very simple solution, of which I really should have thought. Again, thank you! : )
JMTyler
A: 

Via negative look behind:

/
.*?'              #Match until '
(
 .*?              #Lazy match & capture of everything after the first apostrophe
)    
(?<!(?<!\\)\\)'   #Match first apostrophe that isn't preceded by \, but accept \\
.*                #Match remaining text
/
Gavin Miller
A: 

How about a nested negative look behind to handle the pathological case?

.*?'(.*?)(?<!(?<!\\)\\)'.*

edit: oops, thought it was JaredMT who had commented on John's solution.

Jon Freeland
+2  A: 
<?php
$backslash = '\\';

$pattern = <<< PATTERN
#(["'])(?:{$backslash}{$backslash}?+.)*?{$backslash}1#
PATTERN;

foreach(array(
    "<?php \$s = 'Hi everyone, we\\'re ready now.'; ?>",
    '<?php $s = "Hi everyone, we\\"re ready now."; ?>',
    "xyz'a\\'bc\\d'123",
    "x = 'My string ends with with a backslash\\\\';"
    ) as $subject) {
     preg_match($pattern, $subject, $matches);
     echo $subject , ' => ', $matches[0], "\n\n";
}

prints

<?php $s = 'Hi everyone, we\'re ready now.'; ?> => 'Hi everyone, we\'re ready now.'

<?php $s = "Hi everyone, we\"re ready now."; ?> => "Hi everyone, we\"re ready now."

xyz'a\'bc\d'123 => 'a\'bc\d'

x = 'My string ends with with a backslash\\'; => 'My string ends with with a backslash\\'
VolkerK
Voting up because you provided test cases.
the.jxc
A: 
Regex reg = new Regex("(?<!\\\\)'(?<string>.*?)(?<!\\\\)'");
patjbs
+2  A: 

Here's my solution with test cases:

/.*?'((?:\\\\|\\'|[^'])*+)'/

And my (Perl, but I don't use any Perl-specific features I don't think) proof:

use strict;
use warnings;

my %tests = ();
$tests{'Case 1'} = <<'EOF';
$var = 'My string';
EOF

$tests{'Case 2'} = <<'EOF';
$var = 'My string has it\'s challenges';
EOF

$tests{'Case 3'} = <<'EOF';
$var = 'My string ends with a backslash\\';
EOF

foreach my $key (sort (keys %tests)) {
    print "$key...\n";
    if ($tests{$key} =~ m/.*?'((?:\\\\|\\'|[^'])*+)'/) {
        print " ... '$1'\n";
    } else {
        print " ... NO MATCH\n";
    }
}

Running this shows:

$ perl a.pl
Case 1...
 ... 'My string'
Case 2...
 ... 'My string has it\'s challenges'
Case 3...
 ... 'My string ends with a backslash\\'

Note that the initial wildcard at the start needs to be non-greedy. Then I use non-backtracking matches to gobble up \\ and \' and then anything else that is not a standalone quote character.

I think this one probably mimics the compiler's built-in approach, which should make it pretty bullet-proof.

the.jxc