views:

1256

answers:

7

Two days ago I started working on a code parser and I'm stuck.

How can I split a string by commas that are not inside brackets, let me show you what I mean:

I have this string to parse:

one, two, three, (four, (five, six), (ten)), seven

I would like to get this result:

array(
 "one"; 
 "two"; 
 "three"; 
 "(four, (five, six), (ten))"; 
 "seven"
)

but instead I get:

array(
  "one"; 
  "two"; 
  "three"; 
  "(four"; 
  "(five"; 
  "six)"; 
  "(ten))";
  "seven"
)

How can I do this in PHP RegEx.

Thank you in advance !

+1  A: 

I am afraid that it could be very difficult to parse nested brackets like one, two, (three, (four, five)) only with RegExp.

MyKey_
+5  A: 

You can't, directly. You'd need, at minimum, variable-width lookbehind, and last I knew PHP's PCRE only has fixed-width lookbehind.

My first recommendation would be to first extract parenthesized expressions from the string. I don't know anything about your actual problem, though, so I don't know if that will be feasible.

chaos
Yes, that was the hack I was planing to use. Replace the brackets with $1, $2 or something similar, split the string and than restore the brackets in the result. Thank you !
Christian Toma
The point is that what you describe is not a regular language, so regular expressions are an ill fit. So, parsing out all the nested parts first is not a "hack" but the most sensible thing to do.
Svante
+2  A: 

I can't think of a way to do it using a single regex, but it's quite easy to hack together something that works:

function process($data)
{
        $entries = array();
        $filteredData = $data;
        if (preg_match_all("/\(([^)]*)\)/", $data, $matches)) {
                $entries = $matches[0];
                $filteredData = preg_replace("/\(([^)]*)\)/", "-placeholder-", $data);
        }

        $arr = array_map("trim", explode(",", $filteredData));

        if (!$entries) {
                return $arr;
        }

        $j = 0;
        foreach ($arr as $i => $entry) {
                if ($entry != "-placeholder-") {
                        continue;
                }

                $arr[$i] = $entries[$j];
                $j++;
        }

        return $arr;
}

If you invoke it like this:

$data = "one, two, three, (four, five, six), seven, (eight, nine)";
print_r(process($data));

It outputs:

Array
(
    [0] => one
    [1] => two
    [2] => three
    [3] => (four, five, six)
    [4] => seven
    [5] => (eight, nine)
)
Emil H
Thank you very much, this should work. This was how I planned to do it first, but I thought that an easier way exists.
Christian Toma
You're method can not parse "one, two, three, ((five), (four(six))), seven, eight, nine". I think the correct RegEx would be a recursive one: /\(([^()]+|(?R))*\)/.
Christian Toma
You didn't mention that it had to be able to parse recursive expressions back when I first wrote this answer, though. Still, others have definately suggested better solutions after I wrote this.
Emil H
Yes, I'm sorry. My bad.
Christian Toma
+2  A: 

Clumsy, but it does the job...

<?php

function split_by_commas($string) {
  preg_match_all("/\(.+?\)/", $string, $result); 
  $problem_children = $result[0];
  $i = 0;
  $temp = array();
  foreach ($problem_children as $submatch) { 
    $marker = '__'.$i++.'__';
    $temp[$marker] = $submatch;
    $string   = str_replace($submatch, $marker, $string);  
  }
  $result = explode(",", $string);
  foreach ($result as $key => $item) {
    $item = trim($item);
    $result[$key] = isset($temp[$item])?$temp[$item]:$item;
  }
  return $result;
}


$test = "one, two, three, (four, five, six), seven, (eight, nine), ten";

print_r(split_by_commas($test));

?>
Dycey
+5  A: 

Hm... OK already marked as answered, but since you asked for an easy solution I will try nevertheless:

<?php
  $test = "one, two, three, , , ,(four, five, six), seven, (eight, nine)";
  $split = "/([(].*?[)])|(\w)+/";
  preg_match_all($split, $test, $out);
  print_r($out[0]);              
  die();
?>

Output

Array
(
    [0] => one
    [1] => two
    [2] => three
    [3] => (four, five, six)
    [4] => seven
    [5] => (eight, nine)
)
merkuro
Thank you very much, your help is much appreciated. But now I realize that I will also encounter nested brackets and your solution doesn't apply.
Christian Toma
+4  A: 

You can do that easier:

preg_match_all('/[^(,\s]+|\([^)]+\)/', $str, $matches)

But it would be better if you use a real parser. Maybe something like this:

$str = 'one, two, three, (four, (five, six), (ten)), seven';
$buffer = '';
$stack = array();
$depth = 0;
$len = strlen($str);
for ($i=0; $i<$len; $i++) {
    $char = $str[$i];
    switch ($char) {
    case '(':
        $depth++;
        break;
    case ',':
        if (!$depth) {
            if ($buffer !== '') {
                $stack[] = $buffer;
                $buffer = '';
            }
            continue 2;
        }
        break;
    case ' ':
        if (!$depth) {
            continue 2;
        }
        break;
    case ')':
        if ($depth) {
            $depth--;
        } else {
            $stack[] = $buffer.$char;
            $buffer = '';
            continue 2;
        }
        break;
    }
    $buffer .= $char;
}
if ($buffer !== '') {
    $stack[] = $buffer;
}
var_dump($stack);
Gumbo
Yes, it's easier, but doesn't work in case of nested brackets, like so: one, two, three, (four, (five, six), (ten)), seven
Christian Toma
That’s the point where you have to use a real parser. Regular expressions cannot count or handle states.
Gumbo
I have to use regular expressions. Regular expressions are recursive and greedy, you can accomplish this using them.
Christian Toma
No you can’t. Sure, there are features in modern implementations that can accomplish that such like .NET’s *Balancing group* `(?<name1-name2> … )` http://msdn.microsoft.com/bs2twtah.aspx. But they use a state machine and that’s no longer a regular expression in the classical manner.
Gumbo
A: 

I feel that its worth noting, that you should always avoid regular expressions when you possibl can. To that end, you should know that for PHP 5.3+ you could use *str_getcsv()*. However, if you're working with files (or file streams), such as CSV files, then the function fgetcsv() might be what you need, and its been available since PHP4.

Lastly, I'm surprised nobody used preg_split(), or did it not work as needed?

ken
Yes ken, I want to use preg_split(), but what would be the RegEx that ignores commas in brackets ?
Christian Toma
Ah yes, good point, after trying for a min or 2 I can see that its challenging with the conditions set forth.
ken
Yeah you are right, I also tried your solution and doesn't work. Thank you still.
Christian Toma