PHP: Problems finding the most frequent character in a UTF-8 string (eg 唐犬土用家犬尨犬山桑)?

views:

answers:

PHP: Problems finding the most frequent character in a UTF-8 string (eg 唐犬土用家犬尨犬山桑)?

From an MySQL database I can extract the following utf-8 characters:

"唐犬土用家犬尨犬山桑山犬巴戦師子幻日幻月引綱忠犬愛犬戌年成犬教条教義"

I am trying to find the most frequent character in this string. I tried putting each as element into an array $arr and do array_count_values($arr); Unfortunately the array operations (or is print_r the culprit?) produce mis-encoded output like this: [0] => � [1] => � [2] => � [3] => � I can display the characters fine in other situations (i.e. retrieving from MySQL and displaying the characters within php works OKAY!), but the array functions (or array output) seem to mess things up.

I HAVE changed /etc/php5/apache2/php.ini
and put default_charset = "utf-8" in there.

(And I HAVE SET NAMES ... etc)

A) Where is the problem? B) Could I do the job without resorting to arrays altogether (i.e. just using string function)?

Thanks for your help.

How are you turning the string into an array? PHP is not multibyte safe by default, so it's probably splitting multibyte characters in the middle.

Remember that in UTF-8 characters are of variable length. Some characters are one byte, some are two, three, or four. (I think up to six is possible, actually.) You would need a very clever string-split algorithm which knew when two or three bytes were one character, and would leave them together.

In the absence of such an algorithm, the simplest solution might be to convert your string into UTF-32. Now all characters are four bytes long, you can split on every four bytes (to the simplistic PHP string functions, that means every four characters, because PHP thinks a byte is a character).

Combining diacritics might still be a problem, though (two Unicode characters making up one printable character). But at least you wouldn't get broken Unicode. You might get COMBINING DIAERESIS on its own, but that's not very broken. I'm not sure how much combining Unicode characters apply to East Asian languages. I'm not a Unicode expert.

TRiG 2010-10-22 23:12:48

// Here is how I created the array:while($r=mysql_fetch_array($result) AND $i<10){ $text .= $r['japanese']; $i++; }$kanji=array();for($i=0; $i<strlen($text); $i++){ $kanji[$i]=$text[$i]; }// Out of a 10 char unicode string this creates a 19 element array, but as you say, the contents are doubtful...// I wonder whether it would be possible to do the whole evaluation in mysql?

ajo 2010-10-23 06:13:00

`strlen()` is counting bytes, not characters. And I can't find documentation for the `$text[$i]` construction, but I'm sure it's doing the same. The comments on http://php.net/manual/en/function.str-split.php give some tips for Unicode strings.

TRiG 2010-10-23 15:20:53

$text is my string variable, and $text[$i] gives me the $i-th characters at the time (which works with ascii characters...) ; I'll try your links (and doing it in mysql...)

ajo 2010-10-23 16:51:11

This did work in MySQL by looping through all $words, but v.v. slow...:SELECT count(japanese) FROM ( " . " SELECT japanese FROM edict " . " WHERE english LIKE '%" . $q . "%' " . " ORDER BY length(japanese) " . " LIMIT 0,50) AS tbl1 " . " WHERE japanese LIKE '%" . $words[$i] . "%' But better see next comment:

ajo 2010-10-23 20:38:00

Your solution actually works for me now: all the japanese characters seem to be 3 bytes per character, therefore `$k=count($words); for($i=0;$i<$k-1;$i++){ if(strlen($words[$i])>3) break; $freq[$i]=0; for($j=$i+1;$j<$k;$j++){ if(strpos($words[$j],$words[$i])) { $freq[$i]++; } } echo $words[$i] . "=" . $freq[$i] . "<br>"; }`Come to think of it, separating the words into array elements in the first place eliminates the problem of incorrect splitting, and strpos works then anyway... (but still useful to know if the input is a line of text)

ajo 2010-10-23 20:40:28

Glad you got something working!

TRiG 2010-10-26 00:43:16

actually since there were some 'shorter' unicode characters, i'm now using your UTF-32 converting suggestion with very reliable results `$jp32=mb_convert_encoding($jp,"UTF-32","UTF-8"); $arr=str_split($jp32,4); foreach($arr as $w){ $w=mb_convert_encoding($w,"UTF-8","UTF-32"); echo $w;}`

ajo 2010-10-30 07:04:27

ansaurus

tags:

views:

answers:

PHP: Problems finding the most frequent character in a UTF-8 string (eg 唐犬土用家犬尨犬山桑)?

related questions