views:

572

answers:

8

I'm looking for a way to automatically determine the natural language used by a website page, given its URL.

In Python, a function like:

def LanguageUsed (url):
    #stuff

Which returns a language specifier (e.g. 'en' for English, 'jp' for Japanese, etc...)

Summary of Results: I have a reasonable solution working in Python using code from the PyPi for oice.langdet. It does a decent job in discriminating English vs. Non-English, which is all I require at the moment. Note that you have to fetch the html using Python urllib. Also, oice.langdet is GPL license.

For a more general solution using Trigrams in Python as others have suggested, see this Python Cookbook Recipe from ActiveState.

The Google Natural Language Detection API works very well (if not the best I've seen). However, it is Javascript and their TOS forbids automating its use.

+1  A: 

nltk might help (if you have to get down to dealing with the page's text, i.e. if the headers and the url itself don't determine the language sufficiently well for your purposes); I don't think NLTK directly offers a "tell me which language this text is in" function (though NLTK is large and continuously growing, so it might in fact have it), but you can try parsing the given text according to various possible natural languages and checking which ones give the most sensible parse, wordset, &c, according to the rules for each language.

Alex Martelli
A: 

There's no general method that will work solely on URLs. You can check the top-level domain to get some idea, and look for portions of the URL that might be indicative of a language (like "en" or "es" between two slashes), and assume anything unknown is in English, but it isn't a perfect solution.

So far as I know, the only general way to determine the natural language used by a page is to grab the page's text and check for certain common words in each language. For example, if "a", "an", and "the" appear several times in the page, it's likely that it includes English text; "el" and "la" might suggest Spanish; and so on.

Head Geek
A: 

Here is a function that will do that

def LanguageUsed (url):
  from urlparse import urlparse
  o = urlparse(url)
  return o.hostname.split('.')[-1]

I recommend looking at the documentation found here if you want to do more url proccesing.

Peter
Your function returns nothing for hostname when "http://" or "https://" (or another variation) aren't in the url. Then the function crashes because you can't split a NoneType.
mouche
Again, at best you can come up with the country the domain is registered in, which doesn't mean much. Lots of international or even multilingual sites have .com domains. Bottom line is you can't tell by the URL alone.
Markus
+6  A: 

Your best bet really is to use Google's natural language detection api. It returns an iso code for the page language, with a probability index.

See http://code.google.com/apis/ajaxlanguage/documentation/

Vincent Buck
+3  A: 

There is nothing about the URL itself that will indicate language.

One option would be to use a natural language toolkit to try to identify the language based on the content, but even if you can get the NLP part of it working, it'll be pretty slow. Also, it may not be reliable. Remember, most user agents pass something like

Accept-Language: en-US

with each request, and many large websites will serve different content based on that header. Smaller sites will be more reliable because they won't pay attention to the language headers.

You could also use server location (i.e. which country the server is in) as a proxy for language using GeoIP. It's obviously not perfect, but it is much better than using the TLD.

tghw
Geolocation is utterly useless. The world has plenty of places where multiple languages coexists. And websites may also feature multiple languages
Vincent Buck
All I said was that it's better than TLD, which some people are suggesting, and I addressed the multiple languages issue.
tghw
+4  A: 

This is usually accomplished by using character n-gram models. You can find here a state of the art language identifier for Java. If you need some help converting it to Python, just ask. Hope it helps.

JG
A: 

Thanks for the replies so far. Just to clarify a bit, I'm still kind of new to web dev though not to software. I researched inspecting HTML meta tags, XML:Lang tag, and character set detection. Originally I thought this should be a simple problem, given that it only makes sense that each web page should follow some convention and explicitly declare the language used.

After all, Google only returns English results for me, so it must have some way of extracting the natural language during it's crawling. Some of those results are in the US, UK, Canada and other English speaking countries. But then Canada has French also, so how can it know?

Looking for common words is probably the closest general answer thus far. Yet I can't wonder if there isn't a simple solution for what seems a common problem. Or is the web still a wild west?

Also I'm ok with calling other libraries that use the URL to get more info like the geoip or HTML source.

I looked at ntlk for another problem. I'll have to take another look. My only real concern then is possibly performance.

Geo also seems fairly reliable excepting for the edge cases where someone posts say a Spanish web page on a Non-Spanish site.

+1  A: 

You might want to try ngram based detection.

TextCat DEMO (LGPL) seems to work pretty well (recognizes almost 70 languages). There is a python port provided by Thomas Mangin here using the same corpus.

Edit: TextCat competitors page provides some interesting links too.

Edit2: I wonder if making a python wrapper for http://www.mnogosearch.org/guesser/ would be difficult...

wuub