views:

293

answers:

4

I'm looking for code or a product or a service to do semantic analysis of text (sentences and or paragraphs) to categorize the text by general topic, e.g.

  • Finance
  • Entertainment
  • Technology
  • Business
  • Art
  • etc...
A: 

Would this be of any help to you?

http://en.wikipedia.org/wiki/Document%5Fclassification

It's not a finished product or service, neither code, but it describes the various algorithms that can be used for semantic analysis. Googling on a bit further, I believe that it's not really out of the laboratory yet. People are experimenting with KNN algorithms mostly, resulting in cool stuff, but not really what you need:

http://www.ebi.ac.uk/webservices/whatizit/info.jsf

But if there is some software that will do what you ask, it would be in this list:

http://www.kdnuggets.com/software/text.html

For example the LPU program, it seems to be able to learn if you feed it enough teaching documents.

http://www.cs.uic.edu/~liub/LPU/LPU-download.html

littlegreen
A: 

This isn't my area, but I did once work closely to a research unit that trains people in the use of such software and works closely with commercial developers in this field. Might be worth dropping these people a line:-

http://caqdas.soc.surrey.ac.uk/

Purpletoucan
+2  A: 

If you have a bunch of examples that have already been categorised, you can use these to train a classifier. This is a very simple document classfication problem, and any suite of machine learning tools will have the algorithms and tutorials for this. For instance, check out weka: http://www.cs.waikato.ac.nz/ml/weka/

or rapidminer: http://rapid-i.com/content/blogcategory/38/69/

If your needs are limited, and you just want a simple API, you cannot go wrong with this Naive Bayes library: https://ci-bayes.dev.java.net/

Good luck!

gromgull
A: 

If you're into Python/interpreted languages, check out the excellent NLTK framework at nltk.org. It has an excellent how to page and a recently published O'Reilly book.

If you're into Java and/or require a more mature but harder to grasp framework, try GATE instead.