views:

175

answers:

3

I'm looking for a library which can perform a morphological analysis on German words, i.e. it converts any word into its root form and providing meta information about the analysed word.

For example:

gegessen -> essen
wurde [...] gefasst -> fassen
Häuser -> Haus
Hunde -> Hund

My wishlist:

  • It has to work with both nouns and verbs.
  • I'm aware that this is a very hard task given the complexity of the German language, so I'm also looking for libaries which provide only approximations or may only be 80% accurate.
  • I'd prefer libraries which don't work with dictionaries, but again I'm open to compromise given the cirumstances.
  • I'd also prefer C/C++/Delphi Windows libraries, because that would make them easier to integrate but .NET, Java, ... will also do.
  • It has to be a free library. (L)GPL, MPL, ...

EDIT: I'm aware that there is no way to perform a morphological analysis without any dictionary at all, because of the irregular words. When I say, I prefer a library without a dictionary I mean those full blown dictionaries which map each and every word:

arbeite -> arbeiten
arbeitest -> arbeiten
arbeitet -> arbeiten
arbeitete -> arbeiten
arbeitetest -> arbeiten
arbeiteten -> arbeiten
arbeitetet -> arbeiten
gearbeitet -> arbeiten
arbeite -> arbeiten
...

Those dictionaries have several drawbacks, including the huge size and the inability to process unknown words.

Of course all exceptions can only be handled with a dictionary:

esse -> essen
isst -> essen
eßt -> essen
aß -> essen
aßt -> essen
aßen -> essen
...

(My mind is spinning right now :) )

+1  A: 

I don't think that this can be done without a dictionary.

Rules-based approaches will invariably trip over things like

gegessen -> essen
gegangen -> angen

(note to people who don't speak german: the correct solution in the second case is "gehen").

Svante
You are partially right, I updated my question.
DR
+1  A: 

Have a look at Leo. They offer the data which you are after, maybe it gives you some ideas.

weismat
+5  A: 

I think you are looking for a "stemming algorithm".

Martin Porter's approach is well known among linguists. The Porter stemmer is basically an affix stripping algorithm, combined with a few substitution rules for those special cases.

Most stemmers deliver stems that are linguistically "incorrect". For example: both "beautiful" and "beauty" can result in the stem "beauti", which, of course, is not a real word. This doesn't matter, though, if you're using those stems to improve search results in information retrieval systems. Lucene comes with support for the Porter stemmer, for instance.

Porter also devised a simple programming language for developing stemmers, called Snowball.

There are also stemmers for German available in Snowball. A C version, generated from the Snowball source, is also available on the website, along with a plain text explanation of the algorithm.

Here's the German stemmer in Snowball: http://snowball.tartarus.org/algorithms/german/stemmer.html

If you're looking for the corresponding stem of a word as you would find it in a dictionary, along with information on the part of speech, you should Google for "lemmatization".

gclj5