tags:

views:

2009

answers:

2

I need split string by commas and spaces, but ignore the inside quotes, single quotes and parentheses

$str = "Questions, \"Quote\",'single quote','comma,inside' (inside parentheses) space #specialchar";

so that the resultant array will have

[0]Questions
[1]Quote
[2]single quote
[3]comma,inside
[4]inside parentheses
[5]space
[6]#specialchar

my atual regexp is

$tags = preg_split("/[,\s]*[^\w\s]+[\s]*/", $str,0,PREG_SPLIT_NO_EMPTY);

but this is ignoring special chars and stil split the commas inside quotes, the resultant array is :

[0]Questions
[1]Quote
[2]single quote
[3]comma
[4]inside
[5]inside parentheses
[6]space
[7]specialchar

ps: this is no csv

Many Thanks

+3  A: 

This will work only for non-nested parentheses:

    $regex = <<<HERE
    /  "  ( (?:[^"\\\\]++|\\\\.)*+ ) \"
     | '  ( (?:[^'\\\\]++|\\\\.)*+ ) \'
     | \( ( [^)]*                  ) \)
     | [\s,]+
    /x
    HERE;

    $tags = preg_split($regex, $str, -1,
                         PREG_SPLIT_NO_EMPTY
                       | PREG_SPLIT_DELIM_CAPTURE);

The ++ and *+ will consume as much as they can and give nothing back for backtracking. This technique is described in perlre(1) as the most efficient way to do this kind of matching.

Inshallah
Thank you this work good
mozlima
inshallah. do you know the same regexp for using in javascript split() function? it would be nice if you could tell me then.
weng
@unknown, I think the `/x` flag and the `*+` and `++` quantifiers may not be supported, so, lose the `/x` flag and strip any whitespace (including newlines), and instead of the `*+` and `++` quantifiers use only `*` and `+` respectively.
Inshallah
+1  A: 

Well, this works for the data you supplied:

$rgx = <<<'EOT'
/
  [,\s]++
  (?=(?:(?:[^"]*+"){2})*+[^"]*+$)
  (?=(?:(?:[^']*+'){2})*+[^']*+$)
  (?=(?:[^()]*+\([^()]*+\))*+[^()]*+$)
/x
EOT;

The lookaheads assert that if there are any double-quotes, single-quotes or parentheses ahead of the current match position there's an even number of them, and the parens are in balanced pairs (no nesting allowed). That's a quick-and-dirty way to ensure that the current match isn't occurring inside a pair of quotes or parens.

Of course, it assumes the input is well formed. But on the subject of of well-formedness, what about escaped quotes within quotes? What if you have quotes inside parens, or vice-versa? Would this input be legal?

"not a \" quote", 'not a ) quote', (not ",' quotes)

If so, you've got a much more difficult job ahead of you.

Alan Moore