Is there an existing Java library that could tell me whether a String contains English language text or not (e.g. I need to be able to distinguish French or Italian text -- the function needs to return false for French and Italian, and true for English)?
There's no "good" way of doing this imo. All answers can be very complicated on this topic. The obvious part is to check for characters that is in french + italian and not in english and then return false.
However, what if the word is french but has no special characters? Play with the thought you have a whole sentance. You could match each word from dictionaries and if the sentance has more french points than english points, it's not english. This will prevent the common words that french, italian and english have.
Good Luck.
You could try comparing each word to an English, French, or Italian dictionary. Keep in mind though some words may appear in multiple dictionaries.
Here's an interesting blog post that discusses this concept. The examples are in Scala, but you should be able to apply the same general concepts to Java.
If you are looking at individual characters or words, this is a tough problem. Since you're working with a whole document, however, there might be some hope. Unfortunately, I don't know of an existing library to do this.
In general, one would need a fairly comprehensive word list for each language. Then examine each word in the document. If it appears in the dictionary for a language, give that language a "vote". Some words will appear in more than one language, and sometimes a document in one language will use loanwords from another language, but a document wouldn't have to be very long before you saw a very clear trend toward one language.
Some of the best word lists for English are those used by Scrabble players. These lists probably exist for other languages too. The raw lists can be hard to find via Google, but they are out there.
There are various techniques, and a robust method would combine various ones:
- look at the frequencies of groups of n letters (say, groups of 3 letters or trigrams) in your text and see if they are similar to the frequencies found for the language you are testing against
- look at whether the instances of frequent words in the given language match the freuencies found in your text (this tends to work better for longer texts)
- does the text contain characters which strongly narrow it down to a particular language? (e.g. if the text contains an upside down question mark there's a good chance it's Spanish)
can you "loosely parse" certain features in the text that would indicate a particular language, e.g. if it contains a match to the following regular expression, you could take this as a strong clue that the language is French:
\bvous\s+\p{L}+ez\b
To get you started, here are frequent trigram and word counts for English, French and Italian (copied and pasted from some code-- I'll leave it as an exercise to parse them):
Locale.ENGLISH,
"he_=38426;the=38122;nd_=20901;ed_=20519;and=18417;ing=16248;to_=15295;ng_=15281;er_=15192;at_=14219",
"the=11209;and=6631;to=5763;of=5561;a=5487;in=3421;was=3214;his=2313;that=2311;he=2115",
Locale.FRENCH,
"es_=38676;de_=28820;ent=21451;nt_=21072;e_d=18764;le_=17051;ion=15803;s_d=15491;e_l=14888;la_=14260",
"de=10726;la=5581;le=3954;" + ((char)224) + "=3930;et=3563;des=3295;les=3277;du=2667;en=2505;un=1588",
Locale.ITALIAN,
"re_=7275;la_=7251;to_=7208;_di=7170;_e_=7031;_co=5919;che=5876;he_=5622;no_=5546;di_=5460",
"di=7014;e=4045;il=3313;che=3006;la=2943;a=2541;in=2434;per=2165;del=2013;un=1945",
(Trigram counts are per million characters; word counts are per million words. The '_' character represents a word boundary.)
As I recall, the figures are cited in the Oxford Handbook of Computational Linguists and are based on a sample of newspaper articles. If you have a corpus of text in these languages, it's easy enough to derive similar figures yourself.
If you want a really quick-and-dirty way of applying the above, try:
- consider each sequence of three characters in your text (replacing word boundaries with '_')
- for each trigram that matches one of the frequent ones for the given language, increment that language's "score" by 1 (more sophisticatedly, you could weight according to the position in the list)
- at the end, assume the language is that with the highest score
- optionally, do the same for the common words (combine scores)
Obviously, this can then be refined, but you might find that this simple solution is good enough for what you want, since you're essentially interested in "English or not".