views:

4230

answers:

10

Without getting a degree in information retrieval, I'd like to know if there exists any algorithms for counting the frequency that words occur in a given body of text. The goal is to get a "general feel" of what people are saying over a set of textual comments. Along the lines of Wordle.

What I'd like:

  • ignore articles, pronouns, etc ('a', 'an', 'the', 'him', 'them' etc)
  • preserve proper nouns
  • ignore hyphenation, except for soft kind

Reaching for the stars, these would be peachy:

  • handling stemming & plurals (e.g. like, likes, liked, liking match the same result)
  • grouping of adjectives (adverbs, etc) with their subjects ("great service" as opposed to "great", "service")

I've attempted some basic stuff using Wordnet but I'm just tweaking things blindly and hoping it works for my specific data. Something more generic would be great.

A: 

I wonder if this is at all useful to you: http://www.ibm.com/developerworks/linux/library/l-cpnltk.html

Loren Segal
+1  A: 

The algorithm you just described it. A program that does it out of the box with a big button saying "Do it"... I don't know.

But let me be constructive. I recommend you this book Programming Collective Intelligence. Chapters 3 and 4 contain very pragmatic examples (really, no complex theories, just examples).

graffic
+5  A: 

I did this in .NET:

http://blog.wekeroad.com/2007/09/12/text-mining-and-analysis-with-linqtosql-part-1/ http://blog.wekeroad.com/2007/09/20/text-mining-and-analysis-with-linqtosql-part-2/

Hope this helps

Rob Conery
This is brilliant, thank you! You've given some great examples with great libraries. I was trying to avoid exclusion lists though but you've given a really good case for using them.
C4H5As
great! glad I could help :)
Rob Conery
+1  A: 

I wrote a full program to do just this a while back. I can upload a demo later when I get home.

Here is a the code (asp.net/c#): http://naspinski.net/post/Findingcounting-Keywords-out-of-a-Text-Document.aspx

naspinski
I like this - it's nice, simple and fast. It's similar to my current implementation (minus wordnet) but my goal was to scale up to something a little more intelligent.
C4H5As
+20  A: 

You'll need not one, but several nice algorithms, along the lines of the following.

  • ignoring pronouns is done via a stoplist.
  • preserving proper nouns? You mean, detecting named entities, like Hoover Dam and saying "it's one word" or compound nouns, like programming language? I'll give you a hint: that's tough one, but there exist libraries for both. Look for NER (Named entitiy recognition) and lexical chunking. OpenNLP is a Java-Toolkit that does both.
  • ignoring hyphenation? You mean, like at line breaks? Use regular expressions and verify the resulting word via dictionary lookup.
  • handling plurals/stemming: you can look into the Snowball stemmer. It does the trick nicely.
  • "grouping" adjectives with their nouns is generally a task of shallow parsing. But if you are looking specifically for qualitative adjectives (good, bad, shitty, amazing...) you may be interested in sentiment analysis. LingPipe does this, and a lot more.

I'm sorry, I know you said you wanted to KISS, but unfortunately, your demands aren't that easy to meet. Nevertheless, there exist tools for all of this, and you should be able to just tie them together and not have to perform any task yourself, if you don't want to. If you want to perform a task yourself, I suggest you look at stemming, it's the easiest of all.

If you go with Java, combine Lucene with the OpenNLP toolkit. You will get very good results, as Lucene already has a stemmer built in and a lot of tutorial. The OpenNLP toolkit on the other hand is poorly documented, but you won't need too much out of it. You might also be interested in NLTK, written in Python.

I would say you drop your last requirement, as it involves shallow parsing and will definetly not impove your results.

Ah, btw. the exact term of that document-term-frequency-thing you were looking for is called tf-idf. It's pretty much the best way to look for document frequency for terms. In order to do it properly, you won't get around using multidimenional vector matrices.

... Yes, I know. After taking a seminar on IR, my respect for Google was even greater. After doing some stuff in IR, my respect for them fell just as quick, though.

Aleksandar Dimitrov
I enjoyed reading all of this stuff. I'm saving a link to this comment. Thanks!
Justin Bozonier
Thanks, this is a really descriptive answer that's hit all the right notes.
C4H5As
wow, great answer, just what I need
andy
+1  A: 

The first part of your question doesn't sound so bad. All you basically need to do is read each word from the file (or stream w/e) and place it into a prefix tree and each time you happen upon a word that already exists you increment the value associated with it. Of course you would have an ignore list of everything you'd like left out of your calculations as well.

If you use a prefix tree you ensure that to find any word is going to O(N) where N is the maximum length of a word in your data set. The advantage of a prefix tree in this situation is that if you want to look for plurals and stemming you can check in O(M+1) if that's even possible for the word, where M is the length of the word without stem or plurality (is that a word? hehe). Once you've built your prefix tree I would reanalyze it for the stems and such and condense it down so that the root word is what holds the results.

Upon searching you could have some simple rules in place to have the match return positive in case of the root or stem or what have you.

The second part seems extremely challenging. My naive inclination would be to hold separate results for adjective-subject groupings. Use the same principles as above but just keep it separate.

Another option for the semantic analysis could be modeling each sentence as a tree of subject, verb, etc relationships (Sentence has a subject and verb, subject has a noun and adjective, etc). Once you've broken all of your text up in this way it seems like it might be fairly easy to run through and get a quick count of the different appropriate pairings that occurred.

Just some ramblings, I'm sure there are better ideas, but I love thinking about this stuff.

Justin Bozonier
+8  A: 

Welcome to the world of NLP ^_^

All you need is a little basic knowledge and some tools.

There are already tools that will tell you if a word in a sentence is a noun, adjective or verb. They are called part-of-speech taggers. Typically, they take plaintext English as input, and output the word, its base form, and the part-of-speech. Here is the output of a popular UNIX part-of-speech tagger on the first sentence of your post:

$ echo "Without getting a degree in information retrieval, I'd like to know if there exists any algorithms for counting the frequency that words occur in a given body of text." | tree-tagger-english 
# Word  POS  surface form
Without IN without
getting VVG get
a   DT a
degree  NN degree
in  IN in
information NN information
retrieval   NN retrieval
,   , ,
I   PP I
'd  MD will
like    VV like
to  TO to
know    VV know
if  IN if
there   EX there
exists  VVZ exist
any DT any
algorithms  NNS algorithm
for IN for
counting    VVG count
the DT the
frequency   NN frequency
that    IN/that that
words   NNS word
occur   VVP occur
in  IN in
a   DT a
given   VVN give
body    NN body
of  IN of
text    NN text
.   SENT .

As you can see, it identified "algorithms" as being the plural form (NNS) of "algorithm" and "exists" as being a conjugation (VBZ) of "exist." It also identified "a" and "the" as "determiners (DT)" -- another word for article. As you can see, the POS tagger also tokenized the punctuation.

To do everything but the last point on your list, you just need to run the text through a POS tagger, filter out the categories that don't interest you (determiners, pronouns, etc.) and count the frequencies of the base forms of the words.

Here are some popular POS taggers:

TreeTagger (binary only: Linux, Solaris, OS-X)
GENIA Tagger (C++: compile your self)
Stanford POS Tagger (Java)

To do the last thing on your list, you need more than just word-level information. An easy way to start is by counting sequences of words rather than just words themselves. These are called n-grams. A good place to start is UNIX for Poets. If you are willing to invest in a book on NLP, I would recommend Foundations of Statistical Natural Language Processing.

underspecified
Nice explanation, deserved an upvote. Just one thing, though: The Stanford Tagger is a textbook example of what I call "Akademikerware" in German. It's horrible to program with, since the API will give you nothing but headaches. For Java, I prefer OpenNLP and LingPipe. For Python, NLTK.
Aleksandar Dimitrov
Thanks for the upvote :-) I will have to remember "Akademikerware" -- that describes 95% of NLP software IMO. I haven't checked out OpenNLP or LingPipe. I'm not actually a Java guy.
underspecified
+2  A: 

Here is an example of how you might do that in Python, the concepts are similar in any language.

>>> import urllib2, string
>>> devilsdict = urllib2.urlopen('http://www.gutenberg.org/files/972/972.txt').read()
>>> workinglist = devilsdict.split()
>>> cleanlist = [item.strip(string.punctuation) for item in workinglist]
>>> results = {}
>>> skip = {'a':'', 'the':'', 'an':''}
>>> for item in cleanlist:
      if item not in skip:
        try:
          results[item] += 1
        except KeyError:
          results[item] = 1

>>> results
{'': 17, 'writings': 3, 'foul': 1, 'Sugar': 1, 'four': 8, 'Does': 1, "friend's": 1, 'hanging': 4, 'Until': 1, 'marching': 2 ...

The first line just gets libraries that help with parts of the problem, as in the second line, where urllib2 downloads a copy of Ambrose Bierce's "Devil's Dictionary" The next lines make a list of all the words in the text, without punctuation. Then you create a hash table, which in this case is like a list of unique words associated with a number. The for loop goes over each word in the Bierce book, if there is already a record of that word in the table, each new occurrence adds one to the value associated with that word in the table; if the word hasn't appeared yet, it gets added to the table, with a value of 1 (meaning one occurrence.) For the cases you are talking about, you would want to pay much more attention to detail, for example using capitalization to help identify proper nouns only in the middle of sentences, etc., this is very rough but expresses the concept.

To get into the stemming and pluralization stuff, experiment, then look into 3rd party work, I have enjoyed parts of the NLTK, which is an academic open source project, also in python.

bvmou
A: 

How can i use wordnet for natural language processing? My requirments are just to answer 5 W's i.e. What, When , Who, Where and Why...

A: 

U can use the worldnet dictionary to the get the basic information of the question keyword like its past of speech, extract synonym, u can also can do the same for your document to create the index for it. then you can easily match the keyword with index file and rank the document. then summerize it.

tafseer