views:

216

answers:

5

Hi folks,

I'm have a list of articles, each article has its own title and description.

Unfortunately, from the sources I am using, there is no way to know what language they are written.

Also, text is not entirely written in 1 language; almost always English words are present.


I reckon I would need dictionary databases stored on my machine, but it feels a bit unpractical.

What would you suggest I do?

+2  A: 

Have you looked into http://ling.unizd.hr/~dcavar/LID/ and http://en.wikipedia.org/wiki/Language_identification ?

neo
+4  A: 

You could try the Google AJAX Language API if you don't mind using a web service to do your work for you.

Kristo
+4  A: 

I'd use the guess-language project.

Alex Martelli
@Alex: thanks Alex. This is indeed very useful!
RadiantHex
+1 ...very handy!
Andy
+1  A: 

If neos recommendation is also unpractical, I would try something like this:

In many languages there are some keywords which are in many sentences and are often not found in other languages.

Example: "The" in English, "der", "die", "das" in German, ....

Find such words and try to find them in your texts. It can be a little fuzzy at last -- for example, when you find "the" and "der" -- it could be a German text containing some English sentences in it. At least with enough words from your target languages you could come to a high hit-rate.

Juergen
That is what `guess-language` does.
voyager
@voyager: Thanks for the info. I guessed so ;-) No, I did not know the guess-language or other tools before. But I think, all these tools also can not do magic.
Juergen
@Juergen: neither did I, but looked around a bit on the source and that is what it was doing. :)
voyager
+1  A: 

In general you're looking at doing nGram identification. Since this is a python question, you might take a look at http://github.com/koblas/ngramj-python which is a pure python port of the java ngram library (another open source project).

The documentation is lacking, but it has really good accuracy.

koblas