views:

366

answers:

6

I want to have different process for English word and Japanese word in this function

function process_word($word) {
   if($word is english) {
     /////////
   }else if($word is japanese) {
      ////////
   }
}

thank you

A: 

English text usually consists only of ASCII characters (or better say, characters in ASCII range).

Messa
what are the range? do have any links? thanks
bn
Although it's fairly easy to identify most words as being either English or Japanese, there are some characters that belong to both character sets.For example, a string containing only numbers should return true for both English and Japanese.
Jin Kim
+2  A: 

You could try Google's Translation API that has a detection function: http://code.google.com/apis/ajaxlanguage/documentation/#Detect

Alec
A: 

You can try to convert the charset and check if it succeeds.

Take a look at iconv: http://www.php.net/manual/en/function.iconv.php

If you can convert a string to ISO-8859-1 it might be english, if you can convert to iso-2022-jp it is propably japanese (I might be wrong for the exact charsets, you should google for them).

dbemerlin
+1  A: 

Try with mb_detect_encoding function, if encoding is EUC-JP or UTF-8 / UTF-16 it can be japanese, otherwise english. The better is if you can ensure which encoding each language, as UTF encodings can be used for many languages

Benoit
+5  A: 

A quick solution that doesn't need the mb_string extension:

if (strlen($str) != strlen(utf8_decode($str))) {
    // $str uses multi-byte chars (isn't English)
}

else {
    // $str is ASCII (probably English)
}

Or a modification of the solution provided by @Alexander Konstantinov:

function isKanji($str) {
    return preg_match('/[\x{4E00}-\x{9FBF}]/u', $str) > 0;
}

function isHiragana($str) {
    return preg_match('/[\x{3040}-\x{309F}]/u', $str) > 0;
}

function isKatakana($str) {
    return preg_match('/[\x{30A0}-\x{30FF}]/u', $str) > 0;
}

function isJapanese($str) {
    return isKanji($str) || isHiragana($str) || isKatakana($str);
}
Alix Axel
This leaves out english words which use diacritics. These are not used very often, however it's a tradeoff that should be known when making the choice :)
Thomas Winsnes
@Thomas.Winsnes: You mean stuff like `Hai`, `Wa`, `Ka`, `Arigatou` and so on, right?
Alix Axel
@Alix Axel: No, I mean english words like: naïve, café, résumé, soufflé etc.
Thomas Winsnes
@Thomas.Winsnes: Oh, I see. I never understood if those are considered valid english words or not. Specially "café", that I've never seen / heard in either british or american english.
Alix Axel
I always write naïve with a diæresis, and diæresis with a æ.
Lajla
+8  A: 

This function checks whether a word contains at least one Japanese letter (I found unicode range for Japanese letters in Wikipedia).

function isJapanese($word) {
    return preg_match('/[\x{4E00}-\x{9FBF}\x{3040}-\x{309F}\x{30A0}-\x{30FF}]/u', $word);
}
Alexander Konstantinov
+1, Way to go, nice one!
Alix Axel
Great idea! ---
Pekka