Hi
Here is my requirement. I want to tokenize and tag a paragraph in such a way that it allows me to achieve following stuffs.
- Should identify date and time in the paragraph and Tag them as DATE and TIME
- Should identify known phrases in the paragraph and Tag them as CUSTOM
- And rest content should be tokenized should be tokenized by the default nltk's word_tokenize and pos_tag functions?
For example, following sentense
"They all like to go there on 5th November 2010, but I am not interested."
should be tagged and tokenized as follows in case of that custom phrase is "I am not interested".
[('They', 'PRP'), ('all', 'VBP'), ('like', 'IN'), ('to', 'TO'), ('go', 'VB'),
('there', 'RB'), ('on', 'IN'), ('5th November 2010', 'DATE'), (',', ','),
('but', 'CC'), ('I am not interested', 'CUSTOM'), ('.', '.')]
Any suggestions would be useful.