What code snippets are out there for detecting the language of a chunk of UTF-8 text? I basically need to filter a large amount of spam that happens to be in Chinese and Arabic. There's a PECL extension for that, but I want to do this purely in PHP code. I guess I need to loop through a Unicode string with a unicode version of ord() and then create some kind of a range table for different languages.
+2
A:
You could translate the UTF-8 string into its Unicode code points and look for “suspicious ranges”.
function utf8ToUnicode($utf8)
{
if (!is_string($utf8)) {
return false;
}
$unicode = array();
$mbbytes = array();
$mblength = 1;
$strlen = strlen($utf8);
for ($i = 0; $i < $strlen; $i++) {
$byte = ord($utf8{$i});
if ($byte < 128) {
$unicode[] = $byte;
} else {
if (count($mbbytes) == 0) {
$mblength = ($byte < 224) ? 2 : 3;
}
$mbbytes[] = $byte;
if (count($mbbytes) == $mblength) {
if ($mblength == 3) {
$unicode[] = ($mbbytes[0] & 15) * 4096 + ($mbbytes[1] & 63) * 64 + ($mbbytes[2] & 63);
} else {
$unicode[] = ($mbbytes[0] & 31) * 64 + ($mbbytes[1] & 63);
}
$mbbytes = array();
$mblength = 1;
}
}
}
return $unicode;
}
Gumbo
2009-02-04 18:14:57
A:
The simplest approach is probably to have a dictionary of common words in different languages and then test how many positive matches you get against each language. It's a rather costly (computation-wise) task though.
troelskn
2009-02-04 20:14:35
It does not have to be words, just single characters in a certain range are enough to identify Arabic and Chinese.
deadprogrammer
2009-02-05 11:46:33
+4
A:
Pipe your text through Google's language detection. You can do this via AJAX. Here is the documentation/developer's guide. For example:
<html>
<head>
<script type="text/javascript" src="http://www.google.com/jsapi"></script>
<script type="text/javascript">
google.load("language", "1");
function initialize() {
var text = document.getElementById("text").innerHTML;
google.language.detect(text, function(result) {
if (!result.error && result.language) {
google.language.translate(text, result.language, "en",
function(result) {
var translated = document.getElementById("translation");
if (result.translation) {
translated.innerHTML = result.translation;
}
});
}
});
}
google.setOnLoadCallback(initialize);
</script>
</head>
<body>
<div id="text">你好,很高興見到你。</div>
<div id="translation"></div>
</body>
</html>
cletus
2009-02-04 22:01:55
A:
1) wasn't PHP 2) wasn't solution 3) I think deadprogrammer need common solution but not only idea
STEVER
2010-04-21 07:05:24