tags:

views:

58

answers:

3

How to iterate a UTF-8 string character by character using indexing?

When you access a UTF-8 string with the bracket operator $str[0] the utf-encoded character consists of 2 or more elements.

For example:

$str = "Kąt";
$str[0] = "K";
$str[1] = "�";
$str[2] = "�";
$str[3] = "t";

but I would like to have:

$str[0] = "K";
$str[1] = "ą";
$str[2] = "t";

It is possible with mb_substr but this is extremely slow, ie.

mb_substr($str, 0, 1) = "K"
mb_substr($str, 1, 1) = "ą"
mb_substr($str, 2, 1) = "t"

Is there another way to interate the string character by character without using mb_substr?

+5  A: 

Use preg_split, with "u" modifier it supports UTF-8 unicode.

$chrArray = preg_split('//u',$str, -1, PREG_SPLIT_NO_EMPTY);
vartec
This is very elegant, but I'm having a hard time imagining this *faster* than `mb_substr()`.
Pekka
@Pekka It probably is. Using `mb_substr` is quadratic on the length of the string; this is linear even though there's the overhead of building an array. Of course, it takes a lot more memory than your method.
Artefacto
@Artefacto ah, makes sense.
Pekka
I've just tested it. For string of length 100 characters, the preg_split is 50% *faster*.
vartec
Even more, I have tested on more than 1000 'long' documents and it is 40 times faster :-) (see my answer).
czuk
This solution is OK and I have applied in.
czuk
+2  A: 

You could parse each byte of the string and determine whether it is a single (ASCII) character or the start of a multi-byte character:

The UTF-8 encoding is variable-width, with each character represented by 1 to 4 bytes. Each byte has 0–4 leading consecutive '1' bits followed by a '0' bit to indicate its type. 2 or more '1' bits indicates the first byte in a sequence of that many bytes.

you would walk through the string and, instead of increasing the position by 1, read the current character in full and then increase the position by the length that character had.

The Wikipedia article has the interpretation table for each character:

   0-127 Single-byte encoding (compatible with US-ASCII)
 128-191 Second, third, or fourth byte of a multi-byte sequence
 192-193 Overlong encoding: start of 2-byte sequence, 
         but would encode a code point ≤ 127
  ........
Pekka
+2  A: 

In answer to comments posted by @Pekla and @Col. Shrapnel I have compared preg_split with mb_substr.

alt text

The image shows, that preg_split took 1.2s, while mb_substr almost 25s.

Here is the code of the functions:

function split_preg($str){
    return preg_split('//u', $str, -1);     
}

function split_mb($str){
    $length = mb_strlen($str);
    $chars = array();
    for ($i=0; $i<$length; $i++){
        $chars[] = mb_substr($str, $i, 1);
    }
    $chars[] = "";
    return $chars;
}
czuk