views:

75

answers:

2

Hi

Here is my requirement. I want to tokenize and tag a paragraph in such a way that it allows me to achieve following stuffs.

  • Should identify date and time in the paragraph and Tag them as DATE and TIME
  • Should identify known phrases in the paragraph and Tag them as CUSTOM
  • And rest content should be tokenized should be tokenized by the default nltk's word_tokenize and pos_tag functions?

For example, following sentense

"They all like to go there on 5th November 2010, but I am not interested."

should be tagged and tokenized as follows in case of that custom phrase is "I am not interested".

[('They', 'PRP'), ('all', 'VBP'), ('like', 'IN'), ('to', 'TO'), ('go', 'VB'), 
('there', 'RB'), ('on', 'IN'), ('5th November 2010', 'DATE'), (',', ','), 
('but', 'CC'), ('I am not interested', 'CUSTOM'), ('.', '.')]

Any suggestions would be useful.

+1  A: 

The proper answer is to compile a large dataset tagged in the way you want, then train a machine learned chunker on it. If that's too time-consuming, the easy way is to run the POS tagger and post-process its output using regular expressions. Getting the longest match is the hard part here:

s = "They all like to go there on 5th November 2010, but I am not interested."

DATE = re.compile(r'^[1-9][0-9]?(th|st|rd)? (January|...)( [12][0-9][0-9][0-9])?$')

def custom_tagger(sentence):
    tagged = pos_tag(word_tokenize(sentence))
    phrase = []
    date_found = False

    i = 0
    while i < len(tagged):
        (w,t) = tagged[i]
        phrase.append(w)
        in_date = DATE.match(' '.join(phrase))
        date_found |= bool(in_date)
        if date_found and not in_date:          # end of date found
            yield (' '.join(phrase[:-1]), 'DATE')
            phrase = []
            date_found = False
        elif date_found and i == len(tagged)-1:    # end of date found
            yield (' '.join(phrase), 'DATE')
            return
        else:
            i += 1
            if not in_date:
                yield (w,t)
                phrase = []

Todo: expand the DATE re, insert code to search for CUSTOM phrases, make this more sophisticated by matching POS tags as well as tokens and decide whether 5th on its own should count as a date. (Probably not, so filter out dates of length one that only contain an ordinal number.)

larsmans
Thanks for sharing code, let me try this, I will get back to you...
Software Enthusiastic
+2  A: 

You should probably do chunking with the nltk.RegexpParser to achieve your objective.

Reference: http://nltk.googlecode.com/svn/trunk/doc/book/ch07.html#code-chunker1

Neodawn
Let me go through it, I will get back to you...
Software Enthusiastic