Is there a way to detect the language of the data being entered via the input field?
I'm not aware of a PHP solution for this, no.
The Google Translate Ajax APIs may be for you, though.
Check out this Javascript snippet from the API docs: Example: Language Detection
You can use function, which i have written for you:
<?php
/**
* Return`s true if string contains only arabic letters.
*
* @param string $string
* @return bool
*/
function is_arabic($string)
{
return (preg_match("/^\p{Arabic}/i", $string) > 0);
}
But please, check it, before use.
[EDIT 1]
Your question: "How do I detect if an input string is Arabic?" And i have answered to it, what`s wrong?
[EDIT 2]
Read this - http://stackoverflow.com/questions/1441562/detect-language-from-string-in-php
[EDIT 3]
Excuse, i rewrite function to this, try it:
function is_arabic($string)
{
return (preg_match("/^[\x0600-\x06FF]/i", $subject) > 0);
}
I assume you're referring to a Unicode string... in which case, just look for the presence of any character with a code between U+0600–U+06FF (1536–1791) in the string.
hmm i may offer an improved version of DimaKrasun's function:
functoin is_arabic($string) {
if($string === 'arabic') {
return true;
}
return false;
}
okay, enough joking!
Pekkas suggestion to use the google translate api is a good one! but you are relying on an external service which is always more complicated etc.
i think Rushyos approch is good! its just not that easy. i wrote the following function for you but its not tested, but it should work...
<?
function uniord($u) {
// i just copied this function fron the php.net comments, but it should work fine!
$k = mb_convert_encoding($u, 'UCS-2LE', 'UTF-8');
$k1 = ord(substr($k, 0, 1));
$k2 = ord(substr($k, 1, 1));
return $k2 * 256 + $k1;
}
function is_arabic($str) {
if(mb_detect_encoding($str) !== 'UTF-8') {
$str = mb_convert_encoding($str,mb_detect_encoding($str),'UTF-8');
}
/*
$str = str_split($str); <- this function is not mb safe, it splits by bytes, not characters. we cannot use it
$str = preg_split('//u',$str); <- this function woulrd probably work fine but there was a bug reported in some php version so it pslits by bytes and not chars as well
*/
preg_match_all('/.|\n/u', $str, $matches);
$chars = $matches[0];
$arabic_count = 0;
$latin_count = 0;
$total_count = 0;
foreach($chars as $char) {
//$pos = ord($char); we cant use that, its not binary safe
$pos = uniord($char);
echo $char ." --> ".$pos.PHP_EOL;
if($pos >= 1536 && $pos <= 1791) {
$arabic_count++;
} else if($pos > 123 && $pos < 123) {
$latin_count++;
}
$total_count++;
}
if(($arabic_count/$total_count) > 0.6) {
// 60% arabic chars, its probably arabic
return true;
}
return false;
}
$arabic = is_arabic('عربية إخبارية تعمل على مدار اليوم. يمكنك مشاهدة بث القناة من خلال الموقع');
var_dump($arabic);
?>
final thoughs: as you see i added for example a latin counter, the range is just a dummy number b ut this way you could detect charsets (hebrew, latin, arabic, hindi, chinese, etc...)
you may also want to eliminate some chars first... maby @, space, line breaks, slashes etc... the PREG_SPLIT_NO_EMPTY flag for the preg_split function would be useful but bc of the bug i didnt use it here.
you can as well have a counter for all teh character sets and see which one occourse the most...
and finally you should consider chopping your string off after 200 chars or something. this hsould be enough to tell what character set is used.
and you have to do some error handling! like devision by zero, empty string etc etc! dont forget that please... any questions? comment!
if you want to detect the LANGUAGE of a string, you should split into words and check for the words in some pre defined tables. you don't need a complete dictionary, just the most common words and it should work fine. tokenization/normalization is a must as well! there are librarys for that anway and this is not what you asked for :) just wanted to mention it