Hello,
I am new to Natural Language Processing and I want to learn more by creating a simple project. NLTK was suggested to be popular in NLP so I will use it in my project.
Here is what I would like to do:
- I want to scan our company's intranet pages; approximately 3K pages
- I would like to parse and categorize the content of these pages based on certain criteria such as: HR, Engineering, Corporate Pages, etc...
From what I have read so far, I can do this with Named Entity Recognition. I can describe entities for each category of pages, train the NLTK solution and run each page through to determine the category.
Is this the right approach? I appreciate any direction and ideas...
Thanks