I sat down for a while and wrote a fully fledge FSM parser, just for the sake of interest.
It has a few features that you're not likely to see with regex ( at least under PHP, I could do it with recursive regex in Perl, but not PHP, it doesn't have that feature yet ).
- Smart and stack-based bracket parsing
- AnyBracket Support
- Modular
- Extensible.
- When the syntax is wrong, it can tell you where somewhat.
Granted there's a butload of code here, and a lot of it is a bit bizzare and convoluted to new coders, but in terms of what it is, its pretty awesome stuff.
Its not a finished product, just someting I threw together, but it works and doesn't have any bugs that I can find.
I've got 'die' in lots of places where It would normally be better to use Exceptions and whatnot, so a cleanup and refactor would be preferable prior to just rolling it out.
It has a reasonable amount of commenting, but I feel if I were to comment further the nitty-gritty that is finite-state-machining would be harder to understand.
# Pretty Colour Debug of the tokeniser in action.
# Uncomment to use.
function debug( $title, $stream, $msg, $remaining ){
# print chr(27) ."[31m$title" . chr(27) ."[0m\n";
# print chr(27) ."[33min:$stream" . chr(27) ."[0m\n";
# print chr(27) ."[32m$msg" . chr(27) ."[0m\n";
# print chr(27) ."[34mstream:$remaining" . chr(27) ."[0m\n\n";
}
# Simple utility to store a captured part of the stream in one place
# and the remainder somewhere else
# Wraps most the regexy stuff
# Insprired by some Perl Regex Parser I found.
function get_token( $regex, $input ){
$out = array(
'success' => false,
'match' => '',
'rest' => ''
);
if( !preg_match( '/^' . $regex . '/' , $input, $matches ) ){
die("Could not match $regex at start of $input ");
#return $out; # error condition, not matched.
}
$out['match'] = $matches[1];
$out['rest'] = substr( $input, strlen( $out['match'] ) );
$out['success'] = true;
debug( 'Scan For Token: '. $regex , $input, "matched: " . $out['match'] , $out['rest'] );
return $out;
}
function skip_space( $input ){
return get_token('(\s*)', $input );
}
# Given $input and $opener, find
# the data stream that occurs until the respecive closer.
# All nested bracket sets must be well balanced.
# No 'escape code' implementation has been done (yet)
# Match will contain the contents,
# Rest will contain unprocessed part of the string
# []{}() and bracket types are currently supported.
function close_bracket( $input , $opener ){
$out = array(
'success' => false,
'match' => '',
'rest' => ''
);
$map = array( '(' => ')', '[' => ']', '{' => '}', chr(60) => '>' );
$nests = array( $map[$opener] );
while( strlen($input) > 0 ){
$d = get_token( '([^()\[\]{}' . chr(60). '>]*?[()\[\]{}' . chr(60) . '>])', $input );
$input = $d['rest'];
if( !$d['success'] ){
debug( 'Scan For ) Bailing ' , $input, "depth: $nests, matched: " . $out['match'] , $out['rest'] );
$out['match'] .= $d['match'];
return $out; # error condition, not matched. brackets are imbalanced.
}
# Work out which of the 4 bracket types we got, and
# Which orientation it is, and then decide if were going up the tree or down it
end($nests);
$tail = substr( $d['match'], -1, 1 );
if( $tail == current($nests) ){
array_pop( $nests );
} elseif ( array_key_exists( $tail, $map ) ){
array_push( $nests, $map[$tail] );
} else {
die ("Error. Bad bracket Matching, unclosed/unbalanced/unmatching bracket sequence: " . $out['match'] . $d['match'] );
}
$out['match'] .= $d['match'] ;
$out['rest' ] = $d['rest'];
debug( 'Scan For ) running' , $input, "depth: $nests, matched: " . $out['match'] , $out['rest'] );
if ( count($nests) == 0 ){
# Chomp off the tail bracket to just get the body
$out['match'] = substr( $out['match'] , 0 , -1 );
$out['success'] = true;
debug( 'Scan For ) returning ' , $input, "matched: " . $out['match'] , $out['rest'] );
return $out;
}
else {
}
}
die('Scan for closing ) exhausted buffer while searching. Brackets Missmatched. Fix this: \'' . $out['match'] . '\'');
}
# Given $function_name and $input, expects the form fnname(data)
# 'data' can be any well balanced bracket sequence
# also, brackets used for functions in the stream can be any of your choice,
# as long as you're consistent. fnname[foo] will work.
function parse_function_body( $input, $function_name ){
$out = array (
'success' => false,
'match' => '',
'rest' => '',
);
debug( 'Parsing ' . $function_name . "()", $input, "" , "" );
$d = get_token( "(" . $function_name . '[({\[' . chr(60) . '])' , $input );
if ( !$d['success'] ){
die("Doom while parsing for function $function_name. Not Where its expected.");
}
$e = close_bracket( $d['rest'] , substr($d['match'],-1,1) );
if ( !$e['success'] ){
die("Found Imbalanced Brackets while parsing for $function_name, last snapshot was '" . $e['match'] . "'");
return $out; # inbalanced brackets for function
}
$out['success'] = true;
$out['match'] = $e['match'];
$out['rest'] = $e['rest'];
debug( 'Finished Parsing ' . $function_name . "()", $input, 'body:'. $out['match'] , $out['rest'] );
return $out;
}
function parse_query( $input ){
$eat = skip_space( $input );
$get = parse_function_body( $eat['rest'] , 'get' );
if ( !$get['success'] ){
die("Get Token Malformed/Missing, instead found '" . $eat['rest'] . "'");
}
$eat = skip_space( $get['rest'] );
$where = parse_function_body( $eat['rest'], 'where' );
if ( !$where['success'] ){
die("Where Token Malformed/Missing, instead found '" . $eat['rest'] . "'");
}
$eat = skip_space( $where['rest'] );
$sort = parse_function_body( $eat['rest'], 'sort' );
if( !$sort['success'] ){
die("Sort Token Malformed/Missing, instead found '" . $eat['rest'] . "'");
}
return array(
'get' => $get['match'],
'where' => $where['match'],
'sort' => $sort['match'],
'_Trailing_Data' => $sort['rest'],
);
}
$structure = parse_query("get[max(fieldname1),min(fieldname2),fieldname3]where(something=something) sort(fieldname2 asc)");
print_r($structure);
$structure = parse_query("get(max(fieldname1),min(fieldname2),fieldname3)where(something=something) sort(fieldname2 asc)");
print_r($structure);
$structure = parse_query("get{max(fieldname1),min(fieldname2),fieldname3}where(something=something) sort(fieldname2 asc)");
print_r($structure);
$structure = parse_query("get" . chr(60) . "max(fieldname1),min(fieldname2),fieldname3" . chr(60). "where(something=something) sort(fieldname2 asc)");
print_r($structure);
All of the above print_r($structure) lines should produce this:
Array
(
[get] => max(fieldname1),min(fieldname2),fieldname3
[where] => something=something
[sort] => fieldname2 asc
[_Trailing_Data] =>
)