views:

178

answers:

4

This is a noob question from someone who hasn't written a parser/lexer ever before.

I'm writing a tokenizer/parser for CSS in PHP (please don't repeat with 'OMG, why in PHP?'). The syntax is written down by the W3C neatly here (CSS2.1) and here (CSS3, draft).

It's a list of 21 possible tokens, that all (but two) cannot be represented as static strings.

My current approach is to loop through an array containing the 21 patterns over and over again, do an if (preg_match()) and reduce the source string match by match. In principle this works really good. However, for a 1000 lines CSS string this takes something between 2 and 8 seconds, which is too much for my project.

Now I'm banging my head how other parsers tokenize and parse CSS in fractions of seconds. OK, C is always faster than PHP, but nonetheless, are there any obvious D'Oh! s that I fell into?

I made some optimizations, like checking for '@', '#' or '"' as the first char of the remaining string and applying only the relevant regexp then, but this hadn't brought any great performance boosts.

My code (snippet) so far:

$TOKENS = array(
  'IDENT' => '...regexp...',
  'ATKEYWORD' => '@...regexp...',
  'String' => '"...regexp..."|\'...regexp...\'',
  //...
);

$string = '...CSS source string...';
$stream = array();

// we reduce $string token by token
while ($string != '') {
    $string = ltrim($string, " \t\r\n\f"); // unconsumed whitespace at the
        // start is insignificant but doing a trim reduces exec time by 25%
    $matches = array();
    // loop through all possible tokens
    foreach ($TOKENS as $t => $p) {
        // The '&' is used as delimiter, because it isn't used anywhere in
        // the token regexps
        if (preg_match('&^'.$p.'&Su', $string, $matches)) {
            $stream[] = array($t, $matches[0]);
            $string = substr($string, strlen($matches[0]));
            // Yay! We found one that matches!
            continue 2;
        }
    }
    // if we come here, we have a syntax error and handle it somehow
}

// result: an array $stream consisting of arrays with
// 0 => type of token
// 1 => token content
A: 

The first thing I would do would be to get rid of the preg_match(). Basic string functions such as strpos() are much faster, but I don't think you even need that. It looks like you are looking for a specific token at the front of a string with preg_match(), and then simply taking the front length of that string as a substring. You could easily accomplish this with a simple substr() instead, like this:

foreach ($TOKENS as $t => $p)
{
    $front = substr($string,0,strlen($p));
    $len = strlen($p);  //this could be pre-stored in $TOKENS
    if ($front == $p) {
        $stream[] = array($t, $string);
        $string = substr($string, $len);
        // Yay! We found one that matches!
        continue 2;
    }
}

You could further optimize that by pre-calculating the length of all your tokens and storing them in the $TOKENS array, so that you don't have to call strlen() all the time. If you sorted $TOKENS into groups by length, you could reduce the number of substr() calls further as well, as you could take a substr($string) of the current string being analyzed just once for each token length, and run through all the tokens of that length before moving on to the next group of tokens.

zombat
The problem is, that I don't know the token lengths in advance. An @-token, for example, may be '@charset', '@namespace' or '@import', but also something arbitrary like '@-moz-document'. It is defined as '@' followed by 1 or more [a-zA-Z0-9_-] *or* escaped sequences (like `\10FFFF`) *or* any non-ASCII Unicode character. I could abandon `preg_match` and just process chars following an '@', but then I'd have to test char by char, if it is non-ASCII or allowed ASCII or ASCII as part of an escape sequence, and I thought, that this is what regex engines are optimized for.
Boldewyn
+2  A: 

Use a lexer generator.

erikkallen
Thanks for the pointer. I'll look at it (and especially at the generated code).
Boldewyn
Do what erik suggests. Until you understand what a lexer generator offers you and how it works, you won't understand why it can lex input streams so spectacularly fast.
Ira Baxter
A: 

the (probably) faster (but less memory friendly) approach would be to tokenize the whole stream at once, using one big regexp with alternatives for each token, like

 preg_match_all('/
       (...string...)
       |
       (@ident)
       |
       (#ident)
       ...etc
   /x', $stream, $tokens);

 foreach($tokens as $token)...parse
stereofrog
In principle, this would work (as long as the memory constraints don't blow up). However, afterwards I have to loop through every match and find out, what type of token it is. It may work (in many cases, it's sufficient to look at the first string), but I doubt it to be much faster.
Boldewyn
A: 

Don't use regexp, scan character by character.

$tokens = array();
$string = "...code...";
$length = strlen($string);
$i = 0;
while ($i < $length) {
  $buf = '';
  $char = $string[$i];
  if ($char <= ord('Z') && $char >= ord('A') || $char >= ord('a') && $char <= ord('z') || $char == ord('_') || $char == ord('-')) {
    while ($char <= ord('Z') && $char >= ord('A') || $char >= ord('a') && $char <= ord('z') || $char == ord('_') || $char == ord('-')) {
      // identifier
      $buf .= $char;
      $char = $string[$i]; $i ++;
    }
    $tokens[] = array('IDENT', $buf);
  } else if (......) {
    // ......
  }
}

However, that makes the code unmaintainable, therefore, a parser generator is better.

SHiNKiROU