tags:

views:

52

answers:

2

When i run this code and similar some Chinese the ni (你) character (maybe others) gets chopped of and broken.

$sample = "你不喜欢 香蕉 吗";
$parts = preg_split("/[\s,]+/", $sample);
var_dump($parts);

//outputs
array(4) {
  [0]=>
  string(2) "�"
  [1]=>
  string(9) "不喜欢"
  [2]=>
  string(6) "香蕉"
  [3]=>
  string(3) "吗"
}

//in 我觉得 你很 麻烦
//out
array(4) {
  [0]=>
  string(9) "我觉得"
  [1]=>
  string(2) "�"
  [2]=>
  string(3) "很"
  [3]=>
  string(6) "麻烦"
}

Is my regex wrong?

A: 

Since the input string is multi-byte, I guess you'll have to use mb_split in place of preg_split.

codaddict
if i use mb_split i only get `string(25) "我觉得 你很 麻烦"` as output (double space?)
Moak
@Moak With mb_split you can't add delimiters. You specify the global modifiers in another parameter.
Artefacto
+4  A: 

If your string is in UTF-8, you must use the u modifier:

$sample = "你不喜欢 香蕉 吗";
$parts = preg_split("/[\\s,]+/u", $sample);
var_dump($parts);

If it's in another encoding, see unicornaddict's answer.

Artefacto
`非常好 cheers =)`
Moak