ansaurus

Question

Answer 1

A:

The first thing I would do would be to get rid of the preg_match(). Basic string functions such as strpos() are much faster, but I don't think you even need that. It looks like you are looking for a specific token at the front of a string with preg_match(), and then simply taking the front length of that string as a substring. You could easily accomplish this with a simple substr() instead, like this:

foreach ($TOKENS as $t => $p)
{
    $front = substr($string,0,strlen($p));
    $len = strlen($p);  //this could be pre-stored in $TOKENS
    if ($front == $p) {
        $stream[] = array($t, $string);
        $string = substr($string, $len);
        // Yay! We found one that matches!
        continue 2;
    }
}

You could further optimize that by pre-calculating the length of all your tokens and storing them in the $TOKENS array, so that you don't have to call strlen() all the time. If you sorted $TOKENS into groups by length, you could reduce the number of substr() calls further as well, as you could take a substr($string) of the current string being analyzed just once for each token length, and run through all the tokens of that length before moving on to the next group of tokens.

zombat 2010-04-09 19:30:35

The problem is, that I don't know the token lengths in advance. An @-token, for example, may be '@charset', '@namespace' or '@import', but also something arbitrary like '@-moz-document'. It is defined as '@' followed by 1 or more [a-zA-Z0-9_-] *or* escaped sequences (like `\10FFFF`) *or* any non-ASCII Unicode character. I could abandon `preg_match` and just process chars following an '@', but then I'd have to test char by char, if it is non-ASCII or allowed ASCII or ASCII as part of an escape sequence, and I thought, that this is what regex engines are optimized for.

Boldewyn 2010-04-10 09:10:10

Answer 2

+2 A:

Use a lexer generator.

erikkallen 2010-04-09 20:08:22

Thanks for the pointer. I'll look at it (and especially at the generated code).

Boldewyn 2010-04-10 09:13:01

Do what erik suggests. Until you understand what a lexer generator offers you and how it works, you won't understand why it can lex input streams so spectacularly fast.

Ira Baxter 2010-04-11 02:19:39

Answer 3

A:

the (probably) faster (but less memory friendly) approach would be to tokenize the whole stream at once, using one big regexp with alternatives for each token, like

 preg_match_all('/
       (...string...)
       |
       (@ident)
       |
       (#ident)
       ...etc
   /x', $stream, $tokens);

 foreach($tokens as $token)...parse

stereofrog 2010-04-09 21:42:56

In principle, this would work (as long as the memory constraints don't blow up). However, afterwards I have to loop through every match and find out, what type of token it is. It may work (in many cases, it's sufficient to look at the first string), but I doubt it to be much faster.

Boldewyn 2010-04-10 09:12:33

Answer 4

A:

Don't use regexp, scan character by character.

$tokens = array();
$string = "...code...";
$length = strlen($string);
$i = 0;
while ($i < $length) {
  $buf = '';
  $char = $string[$i];
  if ($char <= ord('Z') && $char >= ord('A') || $char >= ord('a') && $char <= ord('z') || $char == ord('_') || $char == ord('-')) {
    while ($char <= ord('Z') && $char >= ord('A') || $char >= ord('a') && $char <= ord('z') || $char == ord('_') || $char == ord('-')) {
      // identifier
      $buf .= $char;
      $char = $string[$i]; $i ++;
    }
    $tokens[] = array('IDENT', $buf);
  } else if (......) {
    // ......
  }
}

However, that makes the code unmaintainable, therefore, a parser generator is better.

SHiNKiROU 2010-04-10 22:40:41

ansaurus

tags:

views:

answers:

Performance of tokenizing CSS in PHP

related questions