views:

234

answers:

5

Is there a way (a program, a library) to approximately know which language a document is written in?

I have a bunch of text documents (~500K) in mixed languages to import in a i18n enabled CMS (Drupal)..

I don't need perfect matches, only some guess.

A: 

There seems to be a Perl module for this:

http://search.cpan.org/~ambs/Lingua-Identify-0.19/lib/Lingua/Identify.pm

Paul.

Paul
A: 

I'd say your best bet is to look for key words - articles, that kind of thing - that are unique to the languages you're looking for. "Un" will show up in both Spanish and French, for example, but "une" is identifiably French whereas "unos", for example, is identifiably Spanish. Diacritics are useful too - you'll see "ñ" in Spanish and possibly Portuguese, "ç" in French and a few others... that kind of thing.

edit - Paul's solution is probably the best; looks like it uses methods like what I outlined, plus a few extra.

Noah Witherspoon
A: 

By running a Google search for "determine language of document" I found many different sites that will help you. The third link on the first page eventually led me to a function in the Google Code API that is exactly what you need.

Stephen
+5  A: 

There is a pretty easy way to do this, given that you have corpus data in all the different languages you'll need to identify. It's called n-gram modeling. I think Lingua::Identify does this already, though, so that is your best bet rather than implementing your own.

Claudiu
A: 

Google Translation API is cool, and has a REST interface. But I need to send it a LOT of BIG document (yes, I could use an excerpt) and, even if Google is Google, I don't think this fair.

Document are also not mine, and Id ask my client if it is ok to send them to a third party (even if, soon or later, G will get them ;)).

I think I'll go trough the Perl path...

Claudio Cicali