I'm looking for code or a product or a service to do semantic analysis of text (sentences and or paragraphs) to categorize the text by general topic, e.g.
- Finance
- Entertainment
- Technology
- Business
- Art
- etc...
I'm looking for code or a product or a service to do semantic analysis of text (sentences and or paragraphs) to categorize the text by general topic, e.g.
Would this be of any help to you?
http://en.wikipedia.org/wiki/Document%5Fclassification
It's not a finished product or service, neither code, but it describes the various algorithms that can be used for semantic analysis. Googling on a bit further, I believe that it's not really out of the laboratory yet. People are experimenting with KNN algorithms mostly, resulting in cool stuff, but not really what you need:
http://www.ebi.ac.uk/webservices/whatizit/info.jsf
But if there is some software that will do what you ask, it would be in this list:
http://www.kdnuggets.com/software/text.html
For example the LPU program, it seems to be able to learn if you feed it enough teaching documents.
This isn't my area, but I did once work closely to a research unit that trains people in the use of such software and works closely with commercial developers in this field. Might be worth dropping these people a line:-
If you have a bunch of examples that have already been categorised, you can use these to train a classifier. This is a very simple document classfication problem, and any suite of machine learning tools will have the algorithms and tutorials for this. For instance, check out weka: http://www.cs.waikato.ac.nz/ml/weka/
or rapidminer: http://rapid-i.com/content/blogcategory/38/69/
If your needs are limited, and you just want a simple API, you cannot go wrong with this Naive Bayes library: https://ci-bayes.dev.java.net/
Good luck!
If you're into Python/interpreted languages, check out the excellent NLTK framework at nltk.org. It has an excellent how to page and a recently published O'Reilly book.
If you're into Java and/or require a more mature but harder to grasp framework, try GATE instead.