tags:

views:

240

answers:

4

In PHP < 6, what is the best way to split a string into an array of Unicode characters? If the input is not necessarily UTF-8?

I want to know whether the set of Unicode characters in an input string is a subset of another set of Unicode characters.

Why not run straight for the mb_ family of functions, as the first couple of answers didn't?

+3  A: 

Try this:

preg_match_all('/./u', $text, $array);
JasonWoof
+1 That’s clever!
Gumbo
+5  A: 

Hi,

You could use the 'u' modifier with PCRE regex ; see Pattern Modifiers (quoting) :

u (PCRE8)

This modifier turns on additional functionality of PCRE that is incompatible with Perl. Pattern strings are treated as UTF-8. This modifier is available from PHP 4.1.0 or greater on Unix and from PHP 4.2.3 on win32. UTF-8 validity of the pattern is checked since PHP 4.3.5.


For instance, considering this code :

header('Content-type: text/html; charset=UTF-8');  // So the browser doesn't make our lives harder
$str = "abc 文字化け, efg";

$results = array();
preg_match_all('/./', $str, $results);
var_dump($results[0]);

You'll only get crap :

array
  0 => string 'a' (length=1)
  1 => string 'b' (length=1)
  2 => string 'c' (length=1)
  3 => string ' ' (length=1)
  4 => string '�' (length=1)
  5 => string '�' (length=1)
  6 => string '�' (length=1)
  7 => string '�' (length=1)
  8 => string '�' (length=1)
  9 => string '�' (length=1)
  10 => string '�' (length=1)
  11 => string '�' (length=1)
  12 => string '�' (length=1)
  13 => string '�' (length=1)
  14 => string '�' (length=1)
  15 => string '�' (length=1)
  16 => string ',' (length=1)
  17 => string ' ' (length=1)
  18 => string 'e' (length=1)
  19 => string 'f' (length=1)
  20 => string 'g' (length=1)


But, with this code :

header('Content-type: text/html; charset=UTF-8');  // So the browser doesn't make our lives harder
$str = "abc 文字化け, efg";

$results = array();
preg_match_all('/./u', $str, $results);
var_dump($results[0]);

(Notice the 'u' at the end of the regex)

You get what you want :

array
  0 => string 'a' (length=1)
  1 => string 'b' (length=1)
  2 => string 'c' (length=1)
  3 => string ' ' (length=1)
  4 => string '文' (length=3)
  5 => string '字' (length=3)
  6 => string '化' (length=3)
  7 => string 'け' (length=3)
  8 => string ',' (length=1)
  9 => string ' ' (length=1)
  10 => string 'e' (length=1)
  11 => string 'f' (length=1)
  12 => string 'g' (length=1)


Hope this help :-)

Pascal MARTIN
+1 good detailed example! :)
Shadi Almosri
@Shadi Almosri : thanks :-)
Pascal MARTIN
+1  A: 

If for some reason the regex way isn't enough for you. I once wrote the Zend_Locale_UTF8 which is abandoned but might be helping you if you decide to do it on your own.

In particular have a look at the class Zend_Locale_UTF8_PHP5_String which reads in Unicode strings and to work with them splits them up into single chars(which may consist out of multiple bytes obviously).

EDIT: I just relaized that ZF's svn-browser is down so I copied the important methods for convenience:

/**
 * Returns the UTF-8 code sequence as an array for any given $string.
 *
 * @access protected
 * @param string|integer $string
 * @return array
 */
protected function _decode( $string ) {

 $string  = (string) $string;
 $length  = strlen($string);
 $sequence = array();

 for ( $i=0; $i<$length; ) {
  $bytes  = $this->_characterBytes($string, $i);
  $ord  = $this->_ord($string, $bytes, $i);

  if ( $ord !== false )
   $sequence[] = $ord;

  if ( $bytes === false )
   $i++;
  else
   $i += $bytes;
 }

 return $sequence;

}

/**
 * Returns the UTF-8 code of a character.
 *
 * @see http://en.wikipedia.org/wiki/UTF-8#Description
 * @access protected
 * @param string $string
 * @param integer $bytes
 * @param integer $position
 * @return integer
 */
protected function _ord( &$string, $bytes = null, $pos=0 )
{
 if ( is_null($bytes) )
  $bytes = $this->_characterBytes($string);

 if ( strlen($string) >= $bytes ) {

  switch ( $bytes ) {
   case 1:
    return ord($string[$pos]);
    break;

   case 2:
    return  ( (ord($string[$pos])  & 0x1f) << 6 ) +
            ( (ord($string[$pos+1]) & 0x3f) );
    break;

   case 3:
    return  ( (ord($string[$pos])  & 0xf) << 12 ) + 
      ( (ord($string[$pos+1]) & 0x3f) << 6 ) +
      ( (ord($string[$pos+2]) & 0x3f) );
    break;

   case 4:
    return  ( (ord($string[$pos])  & 0x7)  << 18 ) + 
      ( (ord($string[$pos+1]) & 0x3f) << 12 ) + 
      ( (ord($string[$pos+1]) & 0x3f) << 6 ) +
      ( (ord($string[$pos+2]) & 0x3f) );
    break;

   case 0:
   default:
    return false;
  }
 }

 return false;
}
/**
 * Returns the number of bytes of the $position-th character.
 *
 * @see http://en.wikipedia.org/wiki/UTF-8#Description
 * @access protected
 * @param string $string
 * @param integer $position
 */
protected function _characterBytes( &$string, $position = 0 ) {
 $char   = $string[$position];
 $charVal  = ord($char);

 if ( ($charVal & 0x80) === 0 )
  return 1;

 elseif ( ($charVal & 0xe0) === 0xc0 )
  return 2;

 elseif ( ($charVal & 0xf0) === 0xe0 )
  return 3;

 elseif ( ($charVal & 0xf8) === 0xf0)
  return 4;
 /*
 elseif ( ($charVal & 0xfe) === 0xf8 )
  return 5;
 */

 return false;
}
André Hoffmann
A: 

I was able to write a solution using mb_*, including a trip to UTF-16 and back in a probably silly attempt to speed up string indexing:

$japanese2 = mb_convert_encoding($japanese, "UTF-16", "UTF-8");
$length = mb_strlen($japanese2, "UTF-16");
for($i=0; $i<$length; $i++) {
    $char = mb_substr($japanese2, $i, 1, "UTF-16");
    $utf8 = mb_convert_encoding($char, "UTF-8", "UTF-16");
    print $utf8 . "\n";
}

I had better luck avoiding mb_internal_encoding and just specifying everything at each mb_* call. I'm sure I'll wind up using the preg solution.

joeforker