views:

1149

answers:

4

What code snippets are out there for detecting the language of a chunk of UTF-8 text? I basically need to filter a large amount of spam that happens to be in Chinese and Arabic. There's a PECL extension for that, but I want to do this purely in PHP code. I guess I need to loop through a Unicode string with a unicode version of ord() and then create some kind of a range table for different languages.

+2  A: 

You could translate the UTF-8 string into its Unicode code points and look for “suspicious ranges”.

function utf8ToUnicode($utf8)
{
    if (!is_string($utf8)) {
     return false;
    }
    $unicode  = array();
    $mbbytes  = array();
    $mblength = 1;
    $strlen   = strlen($utf8);

    for ($i = 0; $i < $strlen; $i++) {
     $byte = ord($utf8{$i});
     if ($byte < 128) {
      $unicode[] = $byte;
     } else {
      if (count($mbbytes) == 0) {
       $mblength = ($byte < 224) ? 2 : 3;
      }
      $mbbytes[] = $byte;
      if (count($mbbytes) == $mblength) {
       if ($mblength == 3) {
        $unicode[] = ($mbbytes[0] & 15) * 4096 + ($mbbytes[1] & 63) * 64 + ($mbbytes[2] & 63);
       } else {
        $unicode[] = ($mbbytes[0] & 31) * 64 + ($mbbytes[1] & 63);
       }
       $mbbytes = array();
       $mblength = 1;
      }
     }
    }
    return $unicode;
}
Gumbo
A: 

The simplest approach is probably to have a dictionary of common words in different languages and then test how many positive matches you get against each language. It's a rather costly (computation-wise) task though.

troelskn
It does not have to be words, just single characters in a certain range are enough to identify Arabic and Chinese.
deadprogrammer
+4  A: 

Pipe your text through Google's language detection. You can do this via AJAX. Here is the documentation/developer's guide. For example:

<html>
  <head>
    <script type="text/javascript" src="http://www.google.com/jsapi"&gt;&lt;/script&gt;
    <script type="text/javascript">

    google.load("language", "1");

    function initialize() {
      var text = document.getElementById("text").innerHTML;
      google.language.detect(text, function(result) {
        if (!result.error && result.language) {
          google.language.translate(text, result.language, "en",
                                    function(result) {
            var translated = document.getElementById("translation");
            if (result.translation) {
              translated.innerHTML = result.translation;
            }
          });
        }
      });
    }
    google.setOnLoadCallback(initialize);

    </script>
  </head>
  <body>
    <div id="text">你好,很高興見到你。</div>
    <div id="translation"></div>
  </body>
</html>
cletus
A: 

1) wasn't PHP 2) wasn't solution 3) I think deadprogrammer need common solution but not only idea

STEVER