views:

140

answers:

2
Named Entity Extraction (extract ppl, cities, organizations)
Content Tagging (extract topic tags by scanning doc)
Structured Data Extraction
Topic Categorization (taxonomy classification by scanning doc....bayesian )
Text extraction (HTML page cleaning)

are there libraries that i can use to do any of the above functions of NLP ?

dont really feel like forking out cash to AlchemyAPI

+6  A: 

There are actually plenty of freely available open-source natural language processing packages out there. Here's a brief list, organized by what language the toolkit is implemented in:

If you're uncertain which one to go with, I would recommend starting with NLTK. The package is reasonably easy to use and has great documentation online, including a free book.

You should be able to use NLTK to easily accomplish the NLP tasks you've listed, e.g. named entity recognition (NER), extracting tags for documents, and document categorization.

What the Alchemy people call structured data extraction looks like it's just HTML scrapping that is robust against changes to the underlying HTML as long as the page still visually renders the same way. So, it's not really a NLP task.

For the extraction of text from HTML, just use boilerpipe. It's fast, good, and free.

dmcer
omg this is the answer i was looking for ! YOU SIR ARE AWESOME
wefwgeweg
If the task at hand is boilerpipe, there's no need to finish an argument about training data.
bmargulies
A: 

The Apache UIMA project was originally created by IBM and provides an NLP framework much like GATE. There are various annotators out there that are built for UIMA.

Thien