ansaurus

Question

A PHP Library / Class to Count Words in Various Languages?

Answer 1

A:

Well, try:

<?
function count_words($str){
     $words = 0;
     $str = eregi_replace(" +", " ", $str);
     $array = explode(" ", $str);
     for($i=0;$i < count($array);$i++)
      {
         if (eregi("[0-9A-Za-zÀ-ÖØ-öø-ÿ]", $array[$i]))
             $words++;
     }
     return $words;
 }
 echo count_words('This is the second one , it will count wrong as well" , it will count 12 instead of 11 because the comma is counted too.');
 ?>

Zuul 2010-05-30 04:56:42

Doesn't work at all for Chinese unfortunately.

deceze 2010-05-30 04:59:07

I'm from Portugal, it's 6AM here... I did sleep yet... but after I can adapt it to Chinese and what ever language... :)

Zuul 2010-05-30 05:19:34

Chinese, Korean, Japanese (...) don't use " ".

Michael Robinson 2010-05-30 07:44:57

Answer 2

+2 A:

Counting chars is easy:

echo strlen('一个有十的字符的句子'); // 30 (WRONG!)
echo strlen(utf8_decode('一个有十的字符的句子')); // 10

Counting words is where things start to get tricky, specially for Chinese, Japanese and other languages that don't use spaces (or other common "word boundary" characters) as word separators. I don't speak Chinese and I don't understand how word counting works in Chinese, so you'll have to educate me a bit - what makes a word in these languages? Is it any specific char or set of chars? I remember reading something related to how hard it was to identify Japanese words in T9 writing but can't find it anymore.

The following should correctly return the number of words in languages that use spaces or punctuation chars as words separators:

count(preg_split('~[\p{Z}\p{P}]+~u', $string, null, PREG_SPLIT_NO_EMPTY));

Alix Axel 2010-06-16 21:04:30

I don't know Korean or Japanese, but the thought of designing a system to count Chinese words makes me shudder. There are no specific word boundaries, and it's debatable as to whether some chars should even be counted as "words" at all... "的" for example is often equivalent to the possessive "'" in English, as in Mike的 Red Car (Mike's Red Car)... I give up, we're going to solve this issue another way (thank ***k for bosses who LISTEN to their lead devs...). To answer "what makes a word in Chinese": nothing!

Michael Robinson 2010-06-17 09:25:08

ansaurus

tags:

views:

answers:

A PHP Library / Class to Count Words in Various Languages?

related questions