tags:

views:

80

answers:

1

hey guys,

i;ve tried wordnet lemmatizer, but i found that some common words like 'studying' or 'waiting' are not processed appropriately.

Am i missing something?

A: 

As you can see on the online wordnet, studying and waiting are also nouns (as well as gerunds of verbs) and so it's not surprising that they can get lemmatized as themselves.

If that's unsatisfactory you need to find a more "aggressive" lemmatizer (one that deliberately ignores perfectly correct but "less likely" interpretations of a word), or, if you can first perform parts-of-speech tagging based on whole sentences, use a lemmatizer that can be told whether e.g. a given instance of studying is a verb rather than a noun.

Alex Martelli
hmmm is it more sensible to use a more aggressive one, like u mentioned, like the porter stemmer, or do a pos tagging first. I'm worried about the performance because theres quite a number of chunks of text i need to handle?
goh
@goh, POS-tagging is not fast, but it IS going to be more accurate -- you probably don't want to see stem "awn" for an "awning", I suspect. But, will you always have the words in the context of a well-formed sentence, or do you need to deal with them in isolation sometimes? if the latter, then the aggressive stemmer is what's left...:-(.
Alex Martelli
@Alex, actually im doing a classification on blogs. I need to infer from their blog content on whether they are from my school. I have a couple of blogs whereby i could start crawling from. The rest would be classified. I'm doing html stripping, then word tokenising, follow by pos tagging, filtering all except the nouns, and lemmatising them. The features for the classifier would be the nouns i guess. Is that a good approach?
goh
@goh, worth a try (the problem is, after all, **extremely** hard) -- but if you're POS tagging anyway to get the nouns, then -- why is keeping e.g. the noun `waiting` as its own stem (which as a noun it is) at all a problem?!
Alex Martelli