ansaurus

Question

Answer 1

+1 A:

The proper answer is to compile a large dataset tagged in the way you want, then train a machine learned chunker on it. If that's too time-consuming, the easy way is to run the POS tagger and post-process its output using regular expressions. Getting the longest match is the hard part here:

s = "They all like to go there on 5th November 2010, but I am not interested."

DATE = re.compile(r'^[1-9][0-9]?(th|st|rd)? (January|...)( [12][0-9][0-9][0-9])?$')

def custom_tagger(sentence):
    tagged = pos_tag(word_tokenize(sentence))
    phrase = []
    date_found = False

    i = 0
    while i < len(tagged):
        (w,t) = tagged[i]
        phrase.append(w)
        in_date = DATE.match(' '.join(phrase))
        date_found |= bool(in_date)
        if date_found and not in_date:          # end of date found
            yield (' '.join(phrase[:-1]), 'DATE')
            phrase = []
            date_found = False
        elif date_found and i == len(tagged)-1:    # end of date found
            yield (' '.join(phrase), 'DATE')
            return
        else:
            i += 1
            if not in_date:
                yield (w,t)
                phrase = []

Todo: expand the DATE re, insert code to search for CUSTOM phrases, make this more sophisticated by matching POS tags as well as tokens and decide whether 5th on its own should count as a date. (Probably not, so filter out dates of length one that only contain an ordinal number.)

larsmans 2010-10-14 13:33:53

Thanks for sharing code, let me try this, I will get back to you...

Software Enthusiastic 2010-10-16 05:28:36

Answer 2

+2 A:

You should probably do chunking with the nltk.RegexpParser to achieve your objective.

Reference: http://nltk.googlecode.com/svn/trunk/doc/book/ch07.html#code-chunker1

Neodawn 2010-10-14 20:39:11

Let me go through it, I will get back to you...

Software Enthusiastic 2010-10-18 06:53:37

ansaurus

tags:

views:

answers:

nltk custom tokenizer and tagger

related questions