views:

54

answers:

3

Hello,

I'm working on a template class and I've an issue when trying to parse out a list of quoted strings from a string argument list. Take for example the string:

$string = 'VAR_SELECTED, \'Hello m\'lady\', "null"';

I'm having a problem coming up with a regex that extracts the string "Hello m'lady" and "null". The closest I have got is

$string = 'VAR_SELECTED, \'Hello m\'lady\', "null", \'TE\'ST\'';
preg_match_all('/(?:[^\']|\\\\.)+|(?:[^"]|\\\\.)+/', $string, $matches);
print_r($matches);

Which outputs:

Array
(
    [0] => Array
        (
            [0] => VAR_SELECTED, 
            [1] => 'Hello m'lady', 
            [2] => "null", 
            [3] => 'TE'ST'
        )

)

However a more complex case of:

$string = 'VAR_SELECTED, \'Hello "Father"\', "Hello \'Luke\'"';
preg_match_all('/(?:[^\']|\\\\.)+|(?:[^"]|\\\\.)+/', $string, $matches);
print_r($matches);  

outputs:

Array
(
    [0] => Array
        (
            [0] => VAR_SELECTED, 
            [1] => 'Hello 
            [2] => "Father"
            [3] => ', 
            [4] => "Hello 
            [5] => 'Luke'
            [6] => "
        )

)

Can anyone help me solve this problem? Are multiple regexes the way forward?

Edit Maybe it would be easier to replace the commas within the strings with a placeholder and then break apart the strings with an explode?

Edit 2 Just thought of a simple insecure option (that I am not going to use), but generates an E_NOTICE error.

$string = 'return array(VAR_SELECTED, \'Hello , "Father"\', "Hello \'Luke\'4");';
$string = eval($string);
print_r($string);
A: 

You want to use a back reference in the match string.

preg_match_all('@([\'"]).*[^\\\\]\1@', $string, $matches);

This will start matching with the first instance of " or ' and then match the longest string that ends with a matching " or ' that isn't escaped.

Array (
[0] => Array
    (
        [0] => 'Hello m'lady', "null", 'TE'ST'
    )

[1] => Array
    (
        [0] => '
    )
Peter Rowell
hmm, the matches required though are 'Hello m'lady', 'null' and 'TE'ST' as individual strings, not one long one.
buggedcom
Oh well. I misread what the problem was. It's that darn old 1-beer handicap thing.
Peter Rowell
+1  A: 

Here's how i would do it:

Break the task down into the component steps you want to take:

1.) Explode the string on commas.

For 'VAR_SELECTED, \'Hello m\'lady\', "null"' this gives me
[0]=>"VAR_SELECTED"
[1]=>" \'Hello m\'lady\'"
[2]=>" "null""

For 'VAR_SELECTED, \'Hello "Father"\', "Hello \'Luke\'"' this gives me
[0]=>"VAR_SELECTED"
[1]=>" \'Hello "Father"\'"
[2]=>" "Hello \'Luke\'""

2.) Run Trim on all three to get rid of any whitespace

For 'VAR_SELECTED, \'Hello m\'lady\', "null"' this gives me
[0]=>"VAR_SELECTED"
[1]=>"\'Hello m\'lady\'"
[2]=>""null""

For 'VAR_SELECTED, \'Hello "Father"\', "Hello \'Luke\'"' this gives me
[0]=>"VAR_SELECTED"
[1]=>"\'Hello "Father"\'"
[2]=>""Hello \'Luke\'""

3.) Run str_replace(" \ "," ",$text) to get rid of the slashes. (remove spaces..added for readability only, so that should be a naked slash and an "empty" string)

For 'VAR_SELECTED, \'Hello m\'lady\', "null"' this gives me
[0]=>"VAR_SELECTED"
[1]=>"'Hello m'lady'"
[2]=>""null""

For 'VAR_SELECTED, \'Hello "Father"\', "Hello \'Luke\'"' this gives me
[0]=>"VAR_SELECTED"
[1]=>"'Hello "Father"'"
[2]=>""Hello 'Luke'""

4.) Run trim again, only trim($text, " ' " ") (remove spaces..added for readability only)

For 'VAR_SELECTED, \'Hello m\'lady\', "null"' this gives me
[0]=>"VAR_SELECTED"
[1]=>"Hello m'lady"
[2]=>"null"

For 'VAR_SELECTED, \'Hello "Father"\', "Hello \'Luke\'"' this gives me
[0]=>"VAR_SELECTED"
[1]=>"Hello "Father""
[2]=>"Hello 'Luke'"

I haven't tested this, but the logic is sound. A quick and dirty way to test 98% of all the regex's (in my experience) is to use http://rubular.com/ It's a great site. Usually if it starts to choke on a regex, it's my first sign that i should break the problem down more. (that's just opinion ~dons flameproof suit~)

Caladain
That would work if the strings do not contain comma's themselves, otherwise you would get broken strings as well.
buggedcom
Caladain
surely the quotes do that? you mean an uncommon string like # or something
buggedcom
Not really. In your samples you show a case where the string doesn't close all of it's quotes. \'Hello m\'lady\'. So if i was breaking on matched quotes, that string wouldn't work. and yeah, uncommon string is pretty standard for giving a delineation character (comma, tilde, a pattern like 00x0 that would never come up..etc) This is a non-trivial problem :-) You have to have some pattern to "break" the string into workable fields. It's why you can't start a PHP string with ' and end with "..the parser is "matching" single and double quotes.
Caladain
+1  A: 

Try this:

/(?<=^|[\s,])(?:(['"]).*?\1|[^\s,'"]+)(?=[\s,]|$)/

Or, as a PHP single-quoted string literal:

'/(?<=^|[\s,])(?:([\'"]).*?\1|[^\s,\'"]+)(?=[\s,]|$)/'

That regex yields the desired result, but I think you're going about this wrong. Usually, if a quoted string needs to contain a literal quote character, the quote is escaped, either with a backslash or with another quote. You aren't doing that, so I had to use a fragile hack based on lookarounds. Are you sure the data isn't supposed to look like this?

$string = 'VAR_SELECTED, \'Hello m\\'lady\', "null"';

$string = 'VAR_SELECTED, \'Hello "Father"\', "Hello \\'Luke\\'"';

Come to think of it, doesn't PHP have built-in support for CSV data?

Alan Moore
Problem is that he says that commas can be in the strings themselves, along with unescaped quotes and mixtures of quotes. I'm almost thinking he needs to crawl the string to find the unmatched "start" characters. But that's awful C++ish for php.
Caladain
@Alan - Thanks, but I think your regex has it. PHP does indeed have a CSV parser, and a str function (php >= 5.3), however with this problem php still fails to correctly parse the data as the enclosures can be either a " or a ' in the same argument list, silly I know but template designers are silly.@Caladain - I think this solves it actually. Try this string with a preg_match. $string = 'VAR_SELECTED, \'Hello , "Father"\', "Hell,o \'Luke\'", \',"\'';
buggedcom
Consider the string :$string = 'VAR_SELECTED, \'Hello, \' "Fa\'ther" \', "Hello, \'Luke, "my Son"\'"';Doesn't break right. Alan's initiution is correct here i think. Lookarounds and backtracking can be very frail. Having uniformly formatted and escaped data makes this a much simpler problem, otherwise you can never guarantee you won't be fed a malformed string (sometimes on purpose to inject code, sometimes because users are monkeys pounding on the keyboard and don't care about properly escaping stuff)
Caladain
@Caladain - the string you gave while does not break right is still separated into its component strings. Even when you consider $string = 'VAR_SELECTED, \'Hello, \' "Fa\'ther" \', "Hello, \'Luke, test';, because of the post processing of each individual match I can safely say that no scripting can get through. So for now, until proven otherwise this will do.
buggedcom
I was hoping you were in a position to change the design, but I guess not.
Alan Moore