views:

596

answers:

10

I am building a project involving natural language processing, since the nlp module currently only deal with english text, so I have to make sure the user submitted content (not long, only several words) is in english. Are there established ways to achieve this? Python or Javascript way preferred.

A: 

You could break the phrase up into words and check a dictionary (there are some that you can download, this may be of interest), but that would require that the dictionary you used was good enough.

It would also fall over for proper nouns (my name isn't in the dictionary for example).

SCdF
+7  A: 

If the content is long enough I would suggest some frequency analysis on the letters.

But for a few words I think your best bet is to compare them to an English dictionary and accept the input if half of them match.

Pat
Your second idea would rule out pretty much every comment on YouTube.
Tyson
@Tyson, Great, another advantage I hadn't thought about ;-)
Pat
+1  A: 

Try:

http://wordlist.sourceforge.net/

For a list of English words.

You will need to be careful of names, e.g. "Canberra" or "Bill Clinton". These won't appear in the word list. I suggest just checking whether the first letter is capitalized as a first attempt.

Owen
+5  A: 

I think the most effective way would be to ask the users to submit english text only :)

You can show a language selection drop-down over your text area with English/ Other as the options. When user selects "Other", disable the text area with a message that only English language is supported [at the moment].

Tahir Akhtar
But you have to validate that, otherwise the nlp module will have problems.
btw0
Yes you are right. But in such applications, it is often good to stress "Garbage in, Garbage Out" rule so there are less user errors.
Tahir Akhtar
+6  A: 

Check the Language Recognition Chart

AquilaX
Is there a known algorithm for using this chart? I mean will you calculate scores for each language and then sort the result? Or we can use a threshold value for each language?
Tahir Akhtar
+4  A: 

Try n-gram based statistical language recognition. This is a link to a demo of an algorithm using this technique, there is also a link to a paper describing the algorithm there. Try the demo, it performs quite well even on very short texts (3-4 words).

Rafał Dowgird
+3  A: 

You are already doing NLP, if your module doesn't understand what language the text was then either the module doesn't work or the input was not in the correct language.

John Ferguson
A: 

The Dictionary Switcher Firefox extensions has an option to detect the right dictionary as I type.
I guess it checks words against the installed dictionaries, and selects the one giving the less errors...

You can't expect all words of the text to be in the dictionary: abbreviations, proper nouns, typos... Beside, some words are common to several languages: a French rock group even made the titles of their disks to have a (different) meaning both in French and in English. So it is a statistical thing: if more than x% of words are found in a good English dictionary, chances are the user types in this language (even if there are mistakes, like probably in this answer, since I am not native English).

PhiLho
+4  A: 

Google has a javascript API that has an implementation of language detection. I've only play tested with it, never used it in production.

http://code.google.com/apis/ajaxlanguage/documentation/#Detect

Prairiedogg
A: 

Maybe "Ensuring that the user submits only English text [PHP]" article will help you. The code is written in PHP, but is small enough to be easily rewritten.

valums