tags:

views:

231

answers:

6

I'm looking at a string and trying to get everything inside the pair of brackets. The contents may change and the max and min may not exist in certain cirumstances.

get(max(fieldname1),min(fieldname2),fieldname3)where(something=something) sort(fieldname2 asc)

The where() and sort() are not guaranteed to be there.
There may be spaces between each set and [EDIT] the keywords may not always be the same.

get(something) where(something)
get(something)where(something) sort(something)

What regex pattern should be used? Effectively, it should return:

Array (
[0] => max(fieldname1),min(fieldname2),fieldname3
[1] => something=something
[2] => fieldname2 asc
)

I realise that changing the first set of brackets to { or [ may solve the problem but I'm stubborn and want to do it this way by regex.

EDIT The best I could come up with using preg_match_all()

/[a-zA-Z0-9_]+\((.*?)\)/
+1  A: 

Since you clarified that those are optional, I don't believe this will be possible to do with a regular expression. You could make it possible by keeping the different clauses (get,where,sort) in their own strings, but I don't think you'll be able to do it as-is.

Edit again: It's conceptually somewhat similar to this question from yesterday, which was shown to be impossible to do with a regex: Regex for checking if a string has mismatched parentheses?

Chad Birch
edited - forgot to mention that where and sort may not be there at all
E3
Damn that's a shame. I'll probably just change the outer brackets to { or [. On another note- how would a parser that examines each individual character perform compared to regex?
E3
Kent Fredric
+4  A: 

You better use a parser such as:

$str = 'get(max(fieldname1),min(fieldname2),fieldname3)where(something=something) sort(fieldname2 asc)';
$array = array();
$buffer = '';
$depth = 0;
for ($i=0; $i<strlen($str); $i++) {
    $buffer .= $str[$i];
    switch ($str[$i]) {
        case '(':
            $depth++;
            break;
        case ')':
            $depth--;
            if ($depth === 0) {
                $array[] = $buffer;
                $buffer = '';
            }
            break;
    }
}
var_dump($array);
Gumbo
Exactly. Regular Expressions can't parse contextual input.
troelskn
A: 

What about?

^\s*get\((.*?)\)(?:\s*where\((.*?)\))(?:\s*sort\((.*?)\)\s*)?$

now I'm not convinced this will work. For example the first match (for get) may overrun into the where and sort clauses. You might be able to deal with this using lookaheads, for example:

^\s*get\(((?:.(?!sort|where))*?)\)(?:\s*where\(((?:.(?!sort))*?)\))(?:\s*sort\((.*?)\)\s*)?$

but really this is a pretty gnarly regular expression and Gumbo is right in that a parser is arguably the better way to go. This is the case for anything where you have matched elements. HTML/XML are a classic case for where regex is used inappropriately and often. It's worse in those cases because parsers are freely available and mature.

There are lots of cases to deal with in something like this:

  • Optionality of parts of the expression;
  • False signals from literals eg get(") sort") will break the above;
  • Escaping of characters;
  • Nesting.

Chad points out the matching pair problem that I was talking about and it's worth reitering. Say you have the following HTML:

<div>
  <div></div>
</div>

Getting the matched pairs of tags is impossible with a regex (yet people keep trying or just not accounting for type of input). What makes your case possibly workable is that you have some known markers you can use:

  • The keywords get, where and sort; and
  • The start and end of the string.

But honestly regex isn't the recommended approach.

So if you want something that is robust and reliable, write a parser. Regex for this kind of thing is nothing more than a quick and dirty solution.

cletus
A: 

I support what has been said about a regexp not being appropriate for a generic structure like this. However, provided that the parentheses are balanced and not more than two deep, these regexes may help:

(\w+\s*\([^()]*(?:(?:\([^()]*\))[^()]*)*\)\s*)

matches and captures a single xyz(....) instance, while

(\w+\s*\([^()]*(?:(?:\([^()]*\))[^()]*)*\)\s*)+

matches all of them. Depending on your language, you may be able to use the second one and disentangle the several captures in the single group. This reference may be helpful.

But, to repeat, I don't think regex is the way - that's why this fairly restrictive solution is so butt-ugly.


Sorry, just noted that you're PHP. You'll probably need to use this:

(\w+\s*\([^()]*(?:(?:\([^()]*\))[^()]*)*\)\s*)(.*)

to divide up your line into (one-piece) plus (the-rest) and loop around until there's nothing left.

Brent.Longborough
A: 

This is very hackish way of doing it, could probably be done better, but just as proof of concept:

get\((max\(.+?\)),(min\(.+?\)),(.+?)\)(where\((.+?=.+?)\)| where\((.+?=.+?)\)|)(sort\((.+?)\)| sort\((.+?)\)|)

Data location will change in match array depending whether or not information is found. You can test it out there!

Maiku Mori
A: 

I sat down for a while and wrote a fully fledge FSM parser, just for the sake of interest.

It has a few features that you're not likely to see with regex ( at least under PHP, I could do it with recursive regex in Perl, but not PHP, it doesn't have that feature yet ).

  1. Smart and stack-based bracket parsing
  2. AnyBracket Support
  3. Modular
  4. Extensible.
  5. When the syntax is wrong, it can tell you where somewhat.

Granted there's a butload of code here, and a lot of it is a bit bizzare and convoluted to new coders, but in terms of what it is, its pretty awesome stuff.

Its not a finished product, just someting I threw together, but it works and doesn't have any bugs that I can find.

I've got 'die' in lots of places where It would normally be better to use Exceptions and whatnot, so a cleanup and refactor would be preferable prior to just rolling it out.

It has a reasonable amount of commenting, but I feel if I were to comment further the nitty-gritty that is finite-state-machining would be harder to understand.



# Pretty Colour Debug of the tokeniser in action. 
# Uncomment to use. 
function debug( $title, $stream, $msg, $remaining ){ 
#  print chr(27) ."[31m$title" . chr(27) ."[0m\n";
# print chr(27) ."[33min:$stream" . chr(27) ."[0m\n";
#  print chr(27) ."[32m$msg" . chr(27) ."[0m\n";
#  print chr(27) ."[34mstream:$remaining" . chr(27) ."[0m\n\n";
}

# Simple utility to store a captured part of the stream in one place
# and the remainder somewhere else
# Wraps most the regexy stuff 
# Insprired by some Perl Regex Parser I found. 

function get_token( $regex, $input ){ 
  $out = array( 
      'success' => false,
      'match' => '',
      'rest' => ''
  );
  if( !preg_match( '/^' . $regex . '/' , $input, $matches ) ){
    die("Could not match $regex at start of $input ");
    #return $out; # error condition, not matched. 
  }
  $out['match'] = $matches[1];
  $out['rest'] = substr( $input, strlen( $out['match'] ) );
  $out['success'] = true;
  debug( 'Scan For Token: '. $regex , $input, "matched: " . $out['match'] , $out['rest'] );
  return $out;
}


function skip_space( $input ){ 
  return get_token('(\s*)', $input ); 
}

# Given $input and $opener, find 
# the data stream that occurs until the respecive closer. 
# All nested bracket sets must be well balanced. 
# No 'escape code' implementation has been done (yet) 
# Match will contain the contents, 
# Rest will contain unprocessed part of the string
# []{}() and  bracket types are currently supported. 

function close_bracket( $input , $opener ){
  $out = array( 
      'success' => false,
      'match' => '',
      'rest' => ''
  );

  $map = array( '(' => ')', '[' => ']', '{' => '}', chr(60) => '>' );
  $nests = array( $map[$opener] ); 

  while( strlen($input) > 0 ){ 
    $d = get_token( '([^()\[\]{}' . chr(60). '>]*?[()\[\]{}' . chr(60)  . '>])', $input ); 
    $input = $d['rest']; 

    if( !$d['success'] ){  
      debug( 'Scan For ) Bailing ' , $input, "depth: $nests, matched: " . $out['match'] , $out['rest'] );

      $out['match'] .= $d['match'];
      return $out; # error condition, not matched. brackets are imbalanced. 
    }

# Work out which of the 4 bracket types we got, and
# Which orientation it is, and then decide if were going up the tree or down it

    end($nests);
    $tail = substr( $d['match'], -1, 1 );
    if( $tail == current($nests) ){ 
      array_pop( $nests );
    } elseif ( array_key_exists( $tail, $map ) ){ 
      array_push( $nests, $map[$tail] ); 
    } else {
      die ("Error. Bad bracket Matching, unclosed/unbalanced/unmatching bracket sequence: " . $out['match'] . $d['match'] );
    }
    $out['match'] .= $d['match'] ; 
    $out['rest' ]  = $d['rest'];
    debug( 'Scan For ) running' , $input, "depth: $nests, matched: " . $out['match'] , $out['rest'] );

    if ( count($nests) == 0 ){ 
      # Chomp off the tail bracket to just get the body
      $out['match'] = substr( $out['match'] , 0 , -1 );
      $out['success'] = true;
      debug( 'Scan For ) returning ' , $input, "matched: " . $out['match'] , $out['rest'] );
      return $out;
    }
    else { 

    }
  }
  die('Scan for closing ) exhausted buffer while searching. Brackets Missmatched. Fix this: \'' . $out['match'] . '\'');
}

# Given $function_name and $input, expects the form fnname(data) 
# 'data' can be any well balanced bracket sequence 
# also, brackets used for functions in the stream can be any of your choice, 
# as long as you're consistent. fnname[foo] will work. 

function parse_function_body( $input, $function_name ){ 
  $out = array ( 
    'success' => false, 
    'match' => '', 
    'rest' => '', 
  );

  debug( 'Parsing  ' . $function_name . "()", $input, "" , "" );

  $d = get_token( "(" . $function_name . '[({\[' . chr(60) . '])' , $input ); 

  if ( !$d['success'] ){ 
     die("Doom while parsing for function $function_name. Not Where its expected.");
  }

  $e = close_bracket( $d['rest'] , substr($d['match'],-1,1) );

  if ( !$e['success'] ){
    die("Found Imbalanced Brackets while parsing for $function_name, last snapshot was '" . $e['match'] . "'");
    return $out; # inbalanced brackets for function
  }
  $out['success'] = true;
  $out['match'] = $e['match']; 
  $out['rest'] = $e['rest'];
  debug( 'Finished Parsing  ' . $function_name . "()", $input, 'body:'. $out['match'] , $out['rest'] );

  return $out;
}

function  parse_query( $input ){ 

  $eat  = skip_space( $input ); 
  $get = parse_function_body( $eat['rest'] , 'get' ); 
  if ( !$get['success'] ){ 
    die("Get Token Malformed/Missing, instead found '" . $eat['rest'] . "'"); 
  }
  $eat = skip_space( $get['rest'] ); 
  $where = parse_function_body( $eat['rest'], 'where' ); 
  if ( !$where['success'] ){ 
    die("Where Token Malformed/Missing, instead found '" . $eat['rest'] . "'"); 
  }
  $eat = skip_space( $where['rest'] ); 
  $sort = parse_function_body( $eat['rest'], 'sort' ); 
  if( !$sort['success'] ){
    die("Sort Token Malformed/Missing, instead found '" . $eat['rest'] . "'"); 
  }
  return array( 
      'get' => $get['match'],
      'where' => $where['match'], 
      'sort' => $sort['match'], 
      '_Trailing_Data' =>  $sort['rest'],
  );
}



$structure = parse_query("get[max(fieldname1),min(fieldname2),fieldname3]where(something=something) sort(fieldname2 asc)");

print_r($structure);

$structure = parse_query("get(max(fieldname1),min(fieldname2),fieldname3)where(something=something) sort(fieldname2 asc)");

print_r($structure);

$structure = parse_query("get{max(fieldname1),min(fieldname2),fieldname3}where(something=something) sort(fieldname2 asc)");

print_r($structure);

$structure = parse_query("get" . chr(60) . "max(fieldname1),min(fieldname2),fieldname3" . chr(60). "where(something=something) sort(fieldname2 asc)");

print_r($structure);
All of the above print_r($structure) lines should produce this:
Array
(
    [get] => max(fieldname1),min(fieldname2),fieldname3
    [where] => something=something
    [sort] => fieldname2 asc
    [_Trailing_Data] =>
)
Kent Fredric