views:

171

answers:

2

Some time in the near future I will need to implement a cross-language word count, or if that is not possible, a cross-language character count.

By word count I mean an accurate count of the words contained within the given text, taking the language of the text. The language of the text is set by a user, and will be assumed to be correct.

By character count I mean a count of the "possibly in a word" characters contained within the given text, with the same language information described above.

I would much prefer the former count, but I am aware of the difficulties involved. I am also aware that the latter count is much easier, but very much prefer the former, if at all possible.

I'd love it if I just had to look at English, but I need to consider every language here, Chinese, Korean, English, Arabic, Hindi, and so on.

I would like to know if Stack Overflow has any leads on where to start looking for an existing product / method to do this in PHP, as I am a good lazy programmer*

A simple test showing how str_word_count with set_locale doesn't work, and a function from php.net's str_word_count page.

*http://blogoscoped.com/archive/2005-08-24-n14.html

A: 

Well, try:

<?
function count_words($str){
     $words = 0;
     $str = eregi_replace(" +", " ", $str);
     $array = explode(" ", $str);
     for($i=0;$i < count($array);$i++)
      {
         if (eregi("[0-9A-Za-zÀ-ÖØ-öø-ÿ]", $array[$i]))
             $words++;
     }
     return $words;
 }
 echo count_words('This is the second one , it will count wrong as well" , it will count 12 instead of 11 because the comma is counted too.');
 ?>
Zuul
Doesn't work at all for Chinese unfortunately.
deceze
I'm from Portugal, it's 6AM here... I did sleep yet... but after I can adapt it to Chinese and what ever language... :)
Zuul
Chinese, Korean, Japanese (...) don't use " ".
Michael Robinson
+2  A: 

Counting chars is easy:

echo strlen('一个有十的字符的句子'); // 30 (WRONG!)
echo strlen(utf8_decode('一个有十的字符的句子')); // 10

Counting words is where things start to get tricky, specially for Chinese, Japanese and other languages that don't use spaces (or other common "word boundary" characters) as word separators. I don't speak Chinese and I don't understand how word counting works in Chinese, so you'll have to educate me a bit - what makes a word in these languages? Is it any specific char or set of chars? I remember reading something related to how hard it was to identify Japanese words in T9 writing but can't find it anymore.

The following should correctly return the number of words in languages that use spaces or punctuation chars as words separators:

count(preg_split('~[\p{Z}\p{P}]+~u', $string, null, PREG_SPLIT_NO_EMPTY));
Alix Axel
I don't know Korean or Japanese, but the thought of designing a system to count Chinese words makes me shudder. There are no specific word boundaries, and it's debatable as to whether some chars should even be counted as "words" at all... "的" for example is often equivalent to the possessive "'" in English, as in Mike的 Red Car (Mike's Red Car)... I give up, we're going to solve this issue another way (thank ***k for bosses who LISTEN to their lead devs...). To answer "what makes a word in Chinese": nothing!
Michael Robinson