ansaurus

Question

How to determine the (natural) language of a document?

Answer 1

A:

Isn't the problem several orders of magnitude easier if you've only got two languages (English and German) to choose from? In this case your approach of a list of stop words might be good enough.

Obviously you'd need to consider a rewrite if you added more languages to your list.

ChrisF 2009-09-05 14:55:24

Answer 2

+2 A:

The stop words approach for the two languages is quick and would be made quicker by heavily weighting ones that don't occur in the other language "das" in German and "the" in English, for example. The use of the "exclusive words" would help extend this approach robustly over a larger group of languages as well.

Pinochle 2009-09-05 14:59:08

Good idea to weight the exclusive words, I think I'll experitment with that.

Robert Petermeier 2009-09-06 13:09:46

Answer 3

+3 A:

I believe the standard procedure is to measure the quality of a proposed algorithm with test data (i.e. with a corpus). Define the percentage of correct analysis that you would like the algorithm to achieve, and then run it over a number of documents which you have manually classified.

As for the specific algorithm: using a list of stop words sounds fine. Another approach that has been reported to work is to use a Bayesian Filter, e.g. SpamBayes. Rather than training it into ham and spam, train it into English and German. Use a portion of your corpus, run that through spambayes, and then test it on the complete data.

Martin v. Löwis 2009-09-05 14:59:11

Thanks for that, using a Bayesian filter is an interesting idea. Unfortunately SpamBayes is in Python which I can't use, plus I don't want to have to train the app. That's why I came up with the idea of using stop words: the statistical work has been done already and is contained in the list.

Robert Petermeier 2009-09-06 13:35:30

Stop words won't work if in an English text a German phrase is cited.

Pavel Shved 2009-09-06 15:41:54

@Robert Petermeier, you're likely going to have to do *some* training. Static algorithms are going to be simply bad. Pre-trained dynamic algorithms will be better, but will still be bad since they won't be trained on your domain (e.g. your specific type of documents). There's no statistical work that's "been done already" that will work for everyone in all scenarios.

Chris S 2010-04-03 03:45:17

Answer 4

+5 A:

Try measure occurences of each letter in text. For English and German texts are calculated the frequencies and, maybe, the distributions of them. Having obtained these data, you may reason what language the distribution of frequencies for your text belongs.

You should use Bayesian inference to determine the closest language (with a certain error probability) or, maybe, there are other statistical methods for such tasks.

Pavel Shved 2009-09-05 15:13:47

I happen to know someone who found that short (3-5) sequences of letters worked *very* well for this.

BCS 2009-09-08 06:41:01

Answer 5

+1 A:

First things first, you should set up a test of your current solution and see if it reaches your desired level of accuracy. Success in your specific domain matters more than following a standard procedure.

If your method needs improving, try weighting your stop words by the rarity in a large corpus of English and German. Or you could use a more complicated technique like training a Markov model or Bayesian classifier. You could expand any of the algorithms to look at higher-order n-grams (for example, two or three word sequences) or other features in the text.

jlc 2009-09-05 15:22:59

Answer 6

+4 A:

English and German use the same set of letters except for ä, ö, ü and ß (eszett). You can look for those letters for determining the language.

You can also look at this text (Comparing two language identification schemes) from Grefenstette. It looks at letter trigrams and short words. Common trigrams for german en_, er_, _de. Common trigrams for English the_, he_, the...

There’s also Bob Carpenter’s How does LingPipe Perform Language ID?

anno 2009-09-05 21:49:02

Thanks for the two links, both are very interesting. I think the LingPipe one addresses a problem of Grefenstette's approaches:"Character-level models are particularly well-suited to language ID because they do not require tokenized input; tokenizers are often language-specific."

Robert Petermeier 2009-09-06 13:25:56

Answer 7

+3 A:

The problem with using a list of stop words is one of robustness. Stop word lists are basically a set of rules, one rule per word. Rule-based methods tend to be less robust to unseen data than statistical methods. Some problems you will encounter are documents that contain equal counts of stop words from each language, documents that have no stop words, documents that have stop words from the wrong language, etc. Rule-based methods can't do anything their rules don't specify.

One approach that doesn't require you to implement Naive Bayes or any other complicated math or machine learning algorithm yourself, is to count character bigrams and trigrams (depending on whether you have a lot or a little of data to start with -- bigrams will work with less training data). Run the counts on a handful of documents (the more the better) of known source language and then construct an ordered list for each language by the number of counts. For example, English would have "th" as the most common bigram. With your ordered lists in hand, count the bigrams in a document you wish to classify and put them in order. Then go through each one and compare its location in the sorted unknown document list to the its rank in each of the training lists. Give each bigram a score for each language as

1 / ABS(RankInUnknown - RankInLanguage + 1).

Whichever language ends up with the highest score is the winner. It's simple, doesn't require a lot of coding, and doesn't require a lot of training data. Even better, you can keep adding data to it as you go on and it will improve. Plus, you don't have to hand-create a list of stop words and it won't fail just because there are no stop words in a document.

It will still be confused by documents that contain equal symmetrical bigram counts. If you can get enough training data, using trigrams will make this less likely. But using trigrams means you also need the unknown document to be longer. Really short documents may require you to drop down to single character (unigram) counts.

All this said, you're going to have errors. There's no silver bullet. Combining methods and choosing the language that maximizes your confidence in each method may be the smartest thing to do.

ealdent 2009-09-07 00:29:18

Thanks for that. By the way, hya linked to a paper that contains the most common trigrams for several languages so I could reuse that (or find such a list for bigrams) and wouldn't have to compute RankInLanguage.

Robert Petermeier 2009-09-07 17:52:43

Interesting, I just found out that this problem and the n-gram solution is actually a students' exercise: http://www.umiacs.umd.edu/~resnik/cl2001/assignments/4/4a.html

Robert Petermeier 2009-09-07 17:54:24

Cool. And there's a Python implementation by Damir Cavar at Indiana: http://ling.unizd.hr/~dcavar/LID/, also with data for a few languages.

ealdent 2009-09-08 12:41:35

A ruby gem also exists: http://github.com/feedbackmine/language_detector

ealdent 2009-09-17 18:16:41

Answer 8

A:

You can use the Google Language Detection API.

Here is a little program that uses it:

baseUrl = "http://ajax.googleapis.com/ajax/services/language/detect"

def detect(text):
    import json,urllib
    """Returns the W3C language code of a natural language"""

    params = urllib.urlencode({'v': '1.0' , "q":text[0:3000]}) # only use first 3000 characters                    
    resp = json.load(urllib.urlopen(baseUrl + "?" + params))
    try:
        retText = resp['responseData']['language']
    except:
        raise
    return retText


def test():
    print "Type some text to detect its language:"
    while True:
        text = raw_input('#>  ')
        retText = detect(text)
        print retText


if __name__=='__main__':
    import sys
    try:
        test()
    except KeyboardInterrupt:
        print "\n"
        sys.exit(0)

Other useful references:

Google Announces APIs (and demo): http://googleblog.blogspot.com/2008/03/new-google-ajax-language-api-tools-for.html

Python wrapper: http://code.activestate.com/recipes/576890-python-wrapper-for-google-ajax-language-api/

Another python script: http://www.halotis.com/2009/09/15/google-translate-api-python-script/

RFC 1766 defines W3C languages

Get the current language codes from: http://www.iana.org/assignments/language-subtag-registry

vy32 2010-04-02 05:29:31

Answer 9

A:

Language detection is not very difficult conceptually. Please look at my reply to a related question and other replies to the same question.

In case you want to take a shot at writing it yourself, you should be able to write a naive detector in half a day. We use something similar to the following algorithm at work and it works surprisingly well. Also read the python implementation tutorial in the post I linked.

Steps:

Take two corpora for the two languages and extract character level bigrams, trigrams and whitespace-delimited tokens (words). Keep a track of their frequencies. This step builds your "Language Model" for both languages.
Given a piece of text, identify the char bigrams, trigrams and whitespace-delimited tokens and their corresponding "relative frequencies" for each corpus. If a particular "feature" (char bigram/trigram or token) is missing from your model, treat its "raw count" as 1 and use it to calculate its "relative frequency".
The product of the relative frequencies for a particular language gives the "score" for the language. This is a very-naive approximation of the probability that the sentence belongs to that language.
The higher scoring language wins.

Note 1: We treat the "raw count" as 1 for features that do not occur in our language model. This is because, in reality, that feature would have a very small value but since we have a finite corpus, we may not have encountered it yet. If you take it's count to be zero, then your entire product would also be zero. To avoid this, we assume that it's occurence is 1 in our corpus. This is called add-one smoothing. There are other advance smoothing techniques.

Note 2: Since you will be multiplying large number of fractions, you can easily run to zero. To avoid this, you can work in a logarithmic space and use this equation to calculate your score.

                a X b =  exp(log(a)+log(b))

Note 3: The algorithm I described is a "very-naive" version of the "Naive Bayes Algorithm".

hashable 2010-04-03 03:38:00

Answer 10

A:

If you're looking to flex your programming muscles trying to solve the problem yourself, I encourage you to; however, the wheel exists if you would like you use it.

Windows 7 ships with this functionality built in. A component called "Extended Linguistic Services" (ELS) has the ability to detect scripts and natural languages, and it's in the box, on any Windows 7 or Windows Server 2008 machine. Depending on whether you have any such machines available and what you mean when you say "free," that will do it for you. In any case, this is an alternative to Google or the other vendors mentioned here.

http://msdn.microsoft.com/en-us/library/dd317700(v=VS.85).aspx

And if you want to access this from .NET, there's some information on that here:

http://windowsteamblog.com/blogs/developers/archive/2009/05/18/windows-7-managed-code-apis.aspx

Hope that helps.

Chris Burrows 2010-04-03 04:37:05

ansaurus

tags:

views:

answers:

How to determine the (natural) language of a document?

related questions