views:

574

answers:

2

I am using NLTK to extract nouns from a text-string starting with the following command: tagged_text = nltk.pos_tag(nltk.Text(nltk.word_tokenize(some_string)))

It works fine in English. Is there an easy way to make it work for German as well? (I have no experience with natural language programming, but I managed to use the python nltk library which is great so far.)

thx for any hints.

+5  A: 

Natural language software does its magic by leveraging corpora and the statistics they provide. You'll need to tell nltk about some German corpus to help it tokenize German correctly. I believe the EUROPARL corpus might help get you going.

See nltk.corpus.europarl.german - this is what you're looking for.

Also, consider tagging this question with "nlp".

Mike Atlas
+1 for beating me to it ;-), also thanks for the hint about tagging the question itself.
mjv
Thx. I got the German file from the EUROPAL corpus with your help and another useful hint. http://code.google.com/p/nltk/issues/detail?id=415On to training the tokenizer. Johannes
Johannes Meier
+3  A: 

Part-of-Speech (POS) tagging is very specific to a particular [natural] language. NLTK includes many different taggers, which use distinct techniques to infer the tag of a given token in a given token. Most (but not all) of these taggers use a statistical model of sorts as the main or sole device to "do the trick". Such taggers require some "training data" upon which to build this statistical representation of the language, and the training data comes in the form of corpora.

The NTLK "distribution" itself includes many of these corpora, as well a set of "corpora readers" which provide an API to read different types of corpora. I don't know the state of affairs in NTLK proper, and if this includes any german corpus. You can however locate free some free corpora which you'll then need to convert to a format that satisfies the proper NTLK corpora reader, and then you can use this to train a POS tagger for the German language.

You can even create your own corpus, but that is a hell of a painstaking job; if you work in a univeristy, you gotta find ways of bribing and otherwise coercing students to do that for you ;-)

mjv