views:

136

answers:

6

Hi, I would like to know of open source tools (for java/python) which could help me extract semantic & stylistic features from text. Examples of semantic features would be adjective-noun ratio, a particular sequence of part-of-speech tags (adjective followed by a noun: adj|nn) etc. Examples of stylistic features would be number of unique words, number of pronouns etc. Currently, I know only of Word to Web Tools which converts a block of text into the rudimentary vector space model.

I am aware of few text-mining packages like GATE, NLTK , Rapid Miner, Mallet and MinorThird . However, I couldn't find any mechanism to suit my task.

Regards,
--Denzil

+1  A: 

I use Lucene's analyzers and indexing mechanism to build vector spaces for documents and then navigate in this space. You can construct term frequency vectors for documents, use an existing document to search other similar documents in the vector space. If your data is big (millions of documents, tens of thousand of features) then you could like Lucene. You can also do stemming, pos tagging and other stuff. This blog post might be a good starting point for POS tagging. In short, Lucene provides you all the necessary mechanism to implement the tasks you mentioned.

One library that I hear frequently is Semantic Vectors. It's again built on Lucene but I don't have a direct experience with that one. Other than this, I suggest to look at Wikipedia's Vector Space Model article.

Amaç Herdağdelen
Amac, Thanks for the response! The "vector space model" though a very robust model is a primitive model and depends more on statistics. I would like to implement a more complex model using semantic knowledge from text like concepts etc. The blog post could help me extract a sequence of POS pattern using Lucene however more light-weight package like NLTK (using a regex of course) can help me perform the same task. Thanks for pointing me to the Semantic Vectors package. Though it doesn't directly help me in my task, I will consider using it for some other tasks.
Denzil
+1  A: 

I used NLTK for some NLP (Natural Language Processing) tasks and it worked really well (albeit kind of slowly). Why exactly do you want such a structured representation of your text? (true question, as depending on the application sometimes much simpler representations can be better)

Gabe
Gabe, Thank for the response! I have a list of sentences in a .txt file (29 Million sentences to be precise). Each sentence is annotated with some topic. There can be multiple annotations per sentence. I have a list of unique words from the text file and the list of unique annotations too. I need to create a word(unique term)-annotation matrix (similar to a term-document matrix). However, I am at sea ends as to how to proceed considering the number of unique words are about 15 million and the number of the annotations are 318k. The size of data structure puts me off.
Denzil
Well that *is* pretty big :-D For starters you probably won't want to read in the whole file at one time, and for seconds: are you sure you need the whole data structure in memory at a single time to do what you're trying to do?Depending on what you're trying to do, it might not be out of the question to look into storing the data in a database (either key-value type DBs (couchdb et. al.) or a simple table in an RDB (mysql).
Gabe
Gabe, I am okay with keeping the features in a RDBMS etc. The main point though is extracting them !
Denzil
+2  A: 

If your text is mostly natural language (in English), you try to extract phrases using a part-of-speech (POS) tagger. Monty tagger is a pure python POS tagger. I've got very satisfactory performance out of a C++ POS tagger, such as the CRFTagger http://sourceforge.net/projects/crftagger/. I tied it to Python using subprocess.Popen. The POS tags allow you to keep only the important pieces of a sentence: nouns and verbs, for example, which can then be indexed using any indexing tools such as Lucene or Xapian (my favourite).

Ken Pu
Denzil
+4  A: 

I think that the Stanford Parser is one of the best and comprehensive NLP tools available for free: not only will it allow you to parse the structural dependencies (to count nouns/adjectives) but it will also give you the grammatical dependencies in the sentence (so you can extract the subject, object, etc). The latter component is something that Python libraries simply cannot do yet (see http://stackoverflow.com/questions/3125926/does-nltk-have-a-tool-for-dependency-parsing) and is probably going to be the most important feature in regards to your software's ability to work with semantics.

If you're interested in Java and Python tools, then Jython is probably the most fun to use for you. I was in the exact same boat, so I wrote this post about using Jython to run the example code provided in the Stanford Parser - I would give it a glance and see what you think: http://blog.gnucom.cc/2010/using-the-stanford-parser-with-jython/

Edit: After reading one of your comments I learned you need to parse 29 Million sentences. I think you could benefit greatly by using pure Java to combine two really powerful technologies: Stanford Parser + Hadoop. Both are written purely in Java and have an extremely rich API that you can use to parse vasts amount of data in a fraction of the time on a cluster of machines. If you don't have the machines, you can use Amazon's EC2 cluster. If you need an example of using Stanford Parser + Hadoop leave a comment for me, and I'll update the post with a URL to my example.

gnucom
+1  A: 

Here's a compilation of Java NLP tools that's reasonably up-to-date: http://www.searchenginecaffe.com/2007/03/java-open-source-text-mining-and.html

LingPipe (http://alias-i.com/lingpipe/) hasn't been mentioned in the answers yet, and is an excellent & actively developed toolkit.

Jon
Jon, LingPipe helps only as far. I have extensively used LingPipe but what I'm asking for is probably not provided by any of the tools including LingPipe.
Denzil
A: 

One of the brilliant libraries I got hold off: http://code.google.com/p/textmatrix/

Denzil