tags:

views:

46

answers:

1

Hello,

I am new to Natural Language Processing and I want to learn more by creating a simple project. NLTK was suggested to be popular in NLP so I will use it in my project.

Here is what I would like to do:

  • I want to scan our company's intranet pages; approximately 3K pages
  • I would like to parse and categorize the content of these pages based on certain criteria such as: HR, Engineering, Corporate Pages, etc...

From what I have read so far, I can do this with Named Entity Recognition. I can describe entities for each category of pages, train the NLTK solution and run each page through to determine the category.

Is this the right approach? I appreciate any direction and ideas...

Thanks

+1  A: 

It looks like you want to do text/document classification wiki, which is not quite the same as Named Entity Recognition, where the goal is to recognize any named entities (proper names, places, institutions etc) in text. However, proper names might be very good features when doing text classification in a limited domain, it is for example likely that a page with the name of the head engineer could be classified as Engineering.

johanbev
What if I want to classify say "Engineering" pages with more depth like "Structural Engineering", or "Electrical Engineering? Then I would have to recognize some regular expression patterns for each engineering discipline. Your example is also a very good one. If no regular expression is not matched to a particular discipline, may be name of an engineer (belonging to a known discipline) in the subject text can be indicative of a particular discipline. Would NER in NLP help to achieve this?
developer
Generally you would train some sort of vector based model, usually based on tf/idf weighting, this is not very difficult in practice nor theory and can often give very good results. More advanced methods do exist if this is not enough. I dont think NER is of much use, neither is creating regexes yourself to categorize the documents, this most probably be a lot of work, esp. if you want fine grained categories, and you will have to invent some sort of confidence score on your own when dealing with more difficult docs.
johanbev
I see your point. How do I do this with NLTK and how do I get started? Can you point me to the right direction? Your help is much appreciated.
developer
If you haven't found it yet, the Natural Language Processing book is a good start for all things NLTK: http://www.nltk.org/book It is open source as well.
winwaed