ansaurus

Question

Answer 1

+5 A:

Use preg_split, with "u" modifier it supports UTF-8 unicode.

$chrArray = preg_split('//u',$str, -1, PREG_SPLIT_NO_EMPTY);

vartec 2010-09-08 09:30:52

This is very elegant, but I'm having a hard time imagining this *faster* than `mb_substr()`.

Pekka 2010-09-08 09:40:22

@Pekka It probably is. Using `mb_substr` is quadratic on the length of the string; this is linear even though there's the overhead of building an array. Of course, it takes a lot more memory than your method.

Artefacto 2010-09-08 09:53:55

@Artefacto ah, makes sense.

Pekka 2010-09-08 09:54:41

I've just tested it. For string of length 100 characters, the preg_split is 50% *faster*.

vartec 2010-09-08 09:58:00

Even more, I have tested on more than 1000 'long' documents and it is 40 times faster :-) (see my answer).

czuk 2010-09-08 10:06:31

This solution is OK and I have applied in.

czuk 2010-09-08 10:09:39

Answer 2

+2 A:

You could parse each byte of the string and determine whether it is a single (ASCII) character or the start of a multi-byte character:

The UTF-8 encoding is variable-width, with each character represented by 1 to 4 bytes. Each byte has 0–4 leading consecutive '1' bits followed by a '0' bit to indicate its type. 2 or more '1' bits indicates the first byte in a sequence of that many bytes.

you would walk through the string and, instead of increasing the position by 1, read the current character in full and then increase the position by the length that character had.

The Wikipedia article has the interpretation table for each character:

   0-127 Single-byte encoding (compatible with US-ASCII)
 128-191 Second, third, or fourth byte of a multi-byte sequence
 192-193 Overlong encoding: start of 2-byte sequence, 
         but would encode a code point ≤ 127
  ........

Pekka 2010-09-08 09:31:35

Answer 3

+2 A:

In answer to comments posted by @Pekla and @Col. Shrapnel I have compared preg_split with mb_substr.

alt text

The image shows, that preg_split took 1.2s, while mb_substr almost 25s.

Here is the code of the functions:

function split_preg($str){
    return preg_split('//u', $str, -1);     
}

function split_mb($str){
    $length = mb_strlen($str);
    $chars = array();
    for ($i=0; $i<$length; $i++){
        $chars[] = mb_substr($str, $i, 1);
    }
    $chars[] = "";
    return $chars;
}

czuk 2010-09-08 10:04:04

ansaurus

tags:

views:

answers:

How to iterate UTF-8 string in PHP?

related questions