in PHP, Is there a way to detect if a string is English or non-English string. suppose the string is in utf-8 format?
You can not detect the language from the character type. And there are no foolproof ways to do this.
With any method, you're just doing an educated guess. There are available some math related articles out there
One approach might be to break the input string into words and then look up those words in an English dictionary to see how many of them are present. This approach has a few limitations:
- proper nouns may not be handled well
- spelling errors can disrupt your lookups
- abbreviations like "lol" or "b4" won't necessarily be in the dictionary
You can probably use the Google Translate API to detect the language and translate it if necessary.
You could do this entirely client side with Google's AJAX Language API.
With the AJAX Language API, you can translate and detect the language of blocks of text within a webpage using only Javascript. In addition, you can enable transliteration on any textfield or textarea in your web page. For example, if you were transliterating to Hindi, this API will allow users to phonetically spell out Hindi words using English and have them appear in the Hindi script.
You can detect automatically a string's language
var text = "¿Dónde está el baño?";
google.language.detect(text, function(result) {
if (!result.error) {
var language = 'unknown';
for (l in google.language.Languages) {
if (google.language.Languages[l] == result.language) {
language = l;
break;
}
}
var container = document.getElementById("detection");
container.innerHTML = text + " is: " + language + "";
}
});
And translate any string written in one of the supported languages
google.language.translate("Hello world", "en", "es", function(result) {
if (!result.error) {
var container = document.getElementById("translation");
container.innerHTML = result.translation;
}
});
Perhaps submit the string to this language guesser:
http://www.xrce.xerox.com/competencies/content-analysis/tools/guesser
I would take documents from various languages and reference them against Unicode. You could then use some bayesian reasoning to determine which language it is by the just the unicode characters used. This would seperate French from English or Russian.
I am not sure exactly on what else could be done except lookup the words in language dictionaries to determine the language (using a similar probabilistic approach).
I've used the Text_LanguageDetect pear package with some reasonable results. It's dead simple to use, and it has a modest 52 language database. The downside is no detection of Eastern Asian languages.
require_once 'Text/LanguageDetect.php';
$l = new Text_LanguageDetect();
$result = $l->detect($text, 4);
if (PEAR::isError($result)) {
echo $result->getMessage();
} else {
print_r($result);
}
results in:
Array
(
[german] => 0.407037037037
[dutch] => 0.288065843621
[english] => 0.283333333333
[danish] => 0.234526748971
)
you can use API of service Lnag ID http://langid.net/identify-language-from-api.html