views:

1858

answers:

6

I'm working on a project where I need to analyze a page of text and collections of pages of text to determine dominant words. I'd like to know if there is a library (prefer c# or java) that will handle the heavy lifting for me. If not, is there an algorithm or multiple that would achieve my goals below.

What I want to do is similar to word clouds built from a url or rss feed that you find on the web, except I don't want the visualization. They are used all the time for analyzing the presidential candidate speeches to see what the theme or most used words are.

The complication, is that I need to do this on thousands of short documents, and then collections or categories of these documents.

My initial plan was to parse the document out, then filter common words - of, the, he, she, etc.. Then count the number of times the remaining words show up in the text (and overall collection/category).

The problem is that in the future, I would like to handle stemming, plural forms, etc.. I would also like to see if there is a way to identify important phrases. (Instead of a count of a word, the count of a phrase being 2-3 words together)

Any guidance on a strategy, libraries or algorithms that would help are appreciated.

+7  A: 

One option for what you're doing is term frequency to inverse document frequency, or tf-idf. The strongest terms will have the highest weighting under this calculation. Check if out here: http://en.wikipedia.org/wiki/Tf-idf

Another option is to use something like a naive bayes classifier using words as features and find what the strongest features are in the text to determine the class of the document. This would work similarly with a maximum entropy classifier.

As far as tools to do this, the best tool to start with would be NLTK, a Python library with extensive documentation and tutorials: http://nltk.sourceforge.net/

For Java, try OpenNLP: http://opennlp.sourceforge.net/

For the phrase stuff, consider the second option I offered up by using bigrams and trigrams as features, or even as terms in tf-idf.

Good luck!

Robert Elwell
+2  A: 

To add to Robert Elwell's answer:

  • stemming and collapsing word forms. A simple method in english is to use Porter Stemming on the lower-cased word forms.
  • a term for the "common words" is "stop word" or "stop list"
  • Reading through the NLTK book as suggested will explain a lot of these introductory issues well.
  • some of the problems you have to tackle are parsing up sentences (so that your bigrams and n-gram phrases don't cross sentence boundaries), splitting up sentences into tokens, and deciding what to do about possessive forms for example.

None of this stuff is clear cut, nor does any of it have "correct answers". See also the "nlp" and "natural-language" SO tags.

Good luck! This is a non-trivial project.

Gregg Lind
I added "natural-language" tag to the post.
+1  A: 
yogman
That sounds like quite a nice package. Nice of MS to gives it away.
Gregg Lind
+1  A: 

Alrighty. So you've got a document containing text and a collection of documents (a corpus). There are a number of ways to do this.

I would suggest using the Lucene engine (Java) to index your documents. Lucene supports a data structure (Index) that maintains a number of documents in it. A document itself is a data structure that can contain "fields" - say, author, title, text, etc. You can choose which fields are indexed and which ones are not.

Adding documents to an index is trivial. Lucene is also built for speed, and can scale superbly.

Next, you want to figure out the terms and the frequencies. Since lucene has already calculated this for you during the indexing process, you can use either the docFreq function and build your own term frequency function, or use the IndexReader class's getTermFreqVectors function to get the terms (and frequencies thereof).

Now its up to you how to sort it and what criteria you want to use to filter the words you want. To figure out relationships, you can use a Java API to the wordnet open source library. To stem words, use Lucene's PorterStemFilter class. The phrase importance part is trickier, but once you've gotten this far - you can search for tips on how to integrate n-gram searching into Lucene (hint).

Good luck!

viksit
A: 

Check MapReduce model to get the word count and then derive the frequency as described in tf-idf

Hadoop is a apache MapReduce framework that can be used for the heavy lifting task of word count on many documents. http://hadoop.apache.org/common/docs/current/mapred%5Ftutorial.html

You cannot get a single framework that would solve all you want. You have to choose a right combination of concepts and framework to get what you want.

A: 

I would also like to see if there is a way to identify important phrases. (Instead of a count of a word, the count of a phrase being 2-3 words together)

This part of your problem is called collocation extraction. (At least if you take 'important phrases' to be phrases that appear significantly more often than by chance.) I gave an answer over at another SO question about that specific subproblem.

Darius Bacon