views:

319

answers:

3

I would like to write a (HTML) parser based on state machine but I have doubts how to acctually read/use an input. I decided to load the whole input into one string and then work with it as with an array and hold its index as current parsing position.

There would be no problems with single-byte encoding, but in multi-byte encoding each value does not represent a character, but a byte of a character.

Example:

$mb_string = 'žščř'; //4 multi-byte characters in UTF-8

for($i=0; $i < 4; $i++)
{
   echo $mb_string[$i], PHP_EOL;
}

Outputs:

Ĺ
ž
Ĺ
Ą

This means I cannot iterate through the string in a loop to check single characters, because I never know if I am in the middle of an character or not.

So the questions are:

  • How do I multi-byte safe read a single character from a string in a performance friendly way?
  • Is it good idea to work with the string as it was an array in this case?
  • How would you read the input?
+1  A: 

http://php.net/mb_string is the thing you're looking for

  • just mb_substr characters one by one
  • not until PHP6
  • what input exactly? The usual way in general
Col. Shrapnel
Note that the comments section for `mb_split` there includes many examples of how to break a multibyte string up into an array of characters - for example, http://us2.php.net/manual/en/function.mb-split.php#80046
Amber
@Dav I don't think he's really need an array.
Col. Shrapnel
@Col. Shrapnel by input I mean the HTML code to parse. Maybe there is absolutely different way how to use the string with state machine which I am missing :-) ... but the mb_substr looks fine (if I know the string encoding, which is not so obvious)
Petr Peller
@Dav Thanks, I was thinking about converting the string into an array of characters, but I think it isn't one of the cleanest solution. I would feel dirty :-)
Petr Peller
A: 
mb_internal_encoding("UTF-8");

$mb_string = 'žščř';

$l=mb_strlen($mb_string);

for($i=0;$i<$l;$i++){
    print(mb_substr($mb_string,$i,1)."<br/>");
}
zaf
A: 

Without using the mdb_relatedFunctions and with multi-byte encoded strings you can use standard sub string functions that read in multiples of the bytes used for encoding.

For example for a UTF-8 encoded (2 bytes) string if you need the first character from the string

$string = 'žščř'; //4 multi-byte characters in UTF-8

You have to get the $string[0] AND $string[1] values, so you are actually looking for the substring between indexes 0 and 1 (for the first character).

Note that $string[0] or $string[N] will reference the first (or Nth byte of the multi-byte string)

regards,

andreas
Wouldn't be quite hard to know how many bytes I have to read? This is trivial example, but in general I don't know what characters are on the input (UTF-8 characters can be 1-4 bytes long).
Petr Peller
Yes you have to determine how many bytes are used but it's an answer that might give you some information on using the NON mb_related functions - and manipulating multi-byte strings. Hope you find it useful.
andreas