views:

142

answers:

4

Is there a way to detect the language of the data being entered via the input field?

+1  A: 

I'm not aware of a PHP solution for this, no.

The Google Translate Ajax APIs may be for you, though.

Check out this Javascript snippet from the API docs: Example: Language Detection

Pekka
Script detection is a very different thing from language detection.
Rushyo
@Rushyo well, at the moment, he is asking for *language* detection rather than script.
Pekka
Taken literally, yes, but I doubt that's the intent.
Rushyo
@Rushyo you don't really know that. I can think of a number of legitimate reasons to try and detect *language*
Pekka
In which case, we'd need to know the dialect as well - info not provided.
Rushyo
@Pekka: you're right, it's language-detection. @Rushyo: The reason is so I can decide whether to display it RTL or LTR. Also, most arab speakers don't know what dialect they speak. It's irrelevant in most cases.
gAMBOOKa
@gAMBOOKa in that case, you can take your pick. I like the character range detection approach outlined in the other answers as well, as it doesn't rely on an external service. If this is going to be extended to other languages, though, or if it's likely you encounter difficult (mixed) data, Google's processing algorithms may be superior.
Pekka
@gAMBOOKa That's the script, not the language... hate to labour the point.
Rushyo
@Rushyo: They're interchangeable depending on context. When was the last time you heard someone say English is a script. @Pekka: We were initially using Google's Language Detection API but now our app needs to function without internet availability as well.
gAMBOOKa
"When was the last time you heard someone say English is a script." Never as Latin is the script. English is the language. And the term Latin script is used all the time - esp. in computing! Localisation is basically impossible without understand that distinction.
Rushyo
@gAMBOOKa Also, the idea that 'most arab speakers don't know what dialect they speak' is nonsense. Different dialects of Arabic can make mutual conversation impossible (to quote Wikipedia: Arabic has many different, geographically distributed spoken varieties, some of which are mutually unintelligible). That's like confusing Breton and French because they're both Latin and based in France!
Rushyo
To reiterate: Arabic script includes Kurdish, Urdu, Sindhi and Kashmiri, Tajik, Kazakh, etc. - in the same way Latin might include English, French, Breton, Cymraeg, German, etc. One man's perfectly sensible Arabic is another man's gibberish.
Rushyo
In other words, if you just want to detect script (which is all you need to decide whether to use RTL or LTR) the problem is trivial and doesn't require anything nearly so complex as language detection - which needs you to teach the system how to detect Kurdish, Urbu, Sindhi, Kashmiri, etc.
Rushyo
I think the lack of distinction between script + language is making your job a helluva lot more complex than it needs to be - I am trying to be helpful, honest :)
Rushyo
+1  A: 

You can use function, which i have written for you:

<?php
/**
 * Return`s true if string contains only arabic letters.
 *
 * @param string $string
 * @return bool
 */
function is_arabic($string)
{
    return (preg_match("/^\p{Arabic}/i", $string) > 0);
}

But please, check it, before use.

[EDIT 1]

Your question: "How do I detect if an input string is Arabic?" And i have answered to it, what`s wrong?

[EDIT 2]

Read this - http://stackoverflow.com/questions/1441562/detect-language-from-string-in-php

[EDIT 3]

Excuse, i rewrite function to this, try it:

function is_arabic($string)
{
    return (preg_match("/^[\x0600-\x06FF]/i", $subject) > 0);
}
DimaKrasun
"Is Arabic" != "Contains 'Arabic'" - the question title may be a bit vague, but the question body is more than clear, no?
Piskvor
If string is arabic, it contains arabic letters or not?
DimaKrasun
Piskvor, DimaKrasun's RegEx ought to indeed detect Arabic characters... not just the string 'Arabic'.
Rushyo
Only reason I proposed my alternative is for speed. RegEx isn't necessarily speedy.
Rushyo
DimaKrasun: Not tested your code, but you appear to have given two different variables names where you intended to have one ($string != $subject)
Rushyo
oh, yes, excuse me
DimaKrasun
A: 

I assume you're referring to a Unicode string... in which case, just look for the presence of any character with a code between U+0600–U+06FF (1536–1791) in the string.

Rushyo
Inclusive, for clarity.
Rushyo
the first thing I thought of regex with U+0600–U+06FF, but next was to use \p{Arabic} - in regex, i think \p{Arabic} is the same with U+0600–U+06FF, but i haven`t tried it
DimaKrasun
I'm pretty sure it's the same, but this method's far quicker.
Rushyo
+2  A: 

hmm i may offer an improved version of DimaKrasun's function:

functoin is_arabic($string) {
    if($string === 'arabic') {
         return true;
    }
    return false;
}

okay, enough joking!

Pekkas suggestion to use the google translate api is a good one! but you are relying on an external service which is always more complicated etc.

i think Rushyos approch is good! its just not that easy. i wrote the following function for you but its not tested, but it should work...

    <?
function uniord($u) {
    // i just copied this function fron the php.net comments, but it should work fine!
    $k = mb_convert_encoding($u, 'UCS-2LE', 'UTF-8');
    $k1 = ord(substr($k, 0, 1));
    $k2 = ord(substr($k, 1, 1));
    return $k2 * 256 + $k1;
}
function is_arabic($str) {
    if(mb_detect_encoding($str) !== 'UTF-8') {
        $str = mb_convert_encoding($str,mb_detect_encoding($str),'UTF-8');
    }

    /*
    $str = str_split($str); <- this function is not mb safe, it splits by bytes, not characters. we cannot use it
    $str = preg_split('//u',$str); <- this function woulrd probably work fine but there was a bug reported in some php version so it pslits by bytes and not chars as well
    */
    preg_match_all('/.|\n/u', $str, $matches);
    $chars = $matches[0];
    $arabic_count = 0;
    $latin_count = 0;
    $total_count = 0;
    foreach($chars as $char) {
        //$pos = ord($char); we cant use that, its not binary safe 
        $pos = uniord($char);
        echo $char ." --> ".$pos.PHP_EOL;

        if($pos >= 1536 && $pos <= 1791) {
            $arabic_count++;
        } else if($pos > 123 && $pos < 123) {
            $latin_count++;
        }
        $total_count++;
    }
    if(($arabic_count/$total_count) > 0.6) {
        // 60% arabic chars, its probably arabic
        return true;
    }
    return false;
}
$arabic = is_arabic('عربية إخبارية تعمل على مدار اليوم. يمكنك مشاهدة بث القناة من خلال الموقع'); 
var_dump($arabic);
?>

final thoughs: as you see i added for example a latin counter, the range is just a dummy number b ut this way you could detect charsets (hebrew, latin, arabic, hindi, chinese, etc...)

you may also want to eliminate some chars first... maby @, space, line breaks, slashes etc... the PREG_SPLIT_NO_EMPTY flag for the preg_split function would be useful but bc of the bug i didnt use it here.

you can as well have a counter for all teh character sets and see which one occourse the most...

and finally you should consider chopping your string off after 200 chars or something. this hsould be enough to tell what character set is used.

and you have to do some error handling! like devision by zero, empty string etc etc! dont forget that please... any questions? comment!

if you want to detect the LANGUAGE of a string, you should split into words and check for the words in some pre defined tables. you don't need a complete dictionary, just the most common words and it should work fine. tokenization/normalization is a must as well! there are librarys for that anway and this is not what you asked for :) just wanted to mention it

Joe Hopfgartner
Your function is making my head go all fuzzy. I'll try to implement it when i'm in a better mood, and let you know if it worked on it. But from what I read, it looks promising.
gAMBOOKa
roger that, don't forget to include the external uniord function on the top! lemme know if ya need any halp
Joe Hopfgartner
The dictionary is a very good idea, only problem is that outside Latin script you quickly encounter circumstances where external context changes characters - such as multi-glyph characters. You'd have to be careful to avoid context-sensitive characters in your dictionary.
Rushyo
@Rushyo ... what? ...if you split the text into words by whitespaces, tokenice, lower case it and see what you hit in your database. if you hit it, see what relations there are. one word can be in more than one languages. from the hit ratio you should brette easy be able to tell. example: "i am your grandfathers computer xcT4" -> tokenzied "i am your grandfather computer xcT4" assume i, am, your, grandfather are engish words and computer is as well english as german. xcT4 is unknown. you will get 4 vs. 1, good ratio to guess its english
Joe Hopfgartner
the original questino was only to detect character set. the provided soultion works very well with multi byte characters. language detection is a whole different thing where multi byte characters dont really matter...
Joe Hopfgartner
A multi-byte character can be made up of multiple glyphs. It is similar to the problem a != á, except that outside of Latin you have situations where characters alter based on the context they are used in. So you have 'abc', yet when you type 'd' your word suddenly changes to 'abXd' or similar. This occurs regularly in Arabic script (for example, with the prefix al-) so simply searching for 'al-' will bring up zip. It's just a gotcha worth looking out for.
Rushyo
To reiterate: A code point is not a character is not a glyph. It makes a seemingly simple problem non-trivial. al- !== al-
Rushyo
See: http://en.wikipedia.org/wiki/Ligature_%28typography%29
Rushyo
Quote: "The Arabic alphabet, historically a cursive derived from the Nabataean alphabet, most letters take a variant shape depending on which they are followed (word-initial), preceded (word-final) or both (medial) by other letters."
Rushyo
Not to mention digraphs.. although since we're not talking Croatian we're safe from those =]
Rushyo
To put it in a Latin context: Encyclopædia !== Encyclopaedia... yet we'd want those to be the same. In Latin these edge cases are so rare as to be no issue in 99.9999% of circumstances, in Arabic it's a much bigger problem. That said, Unicode neatly side-steps some of the problems by dumping the problem on the renderer.
Rushyo