tags:

views:

259

answers:

6

I have a big collection of human generated content. I want to find the words or phrases that occur most often. What is an efficient way to do this?

+4  A: 

Don't reinvent the wheel. Use a full text search engine such as Lucene.

hobodave
I do love lucene, but I think it's a bit overkill here. Counting words is simple enough using a hash table.
Emil H
How do you know the solution is overkill? Do you have a better understanding of "big collection of human generated content" than I?
hobodave
I don't want to start a word war here. It's quite possible that you're right. One reason that this might be a good solution is that lucene includes some excellent tokenizers that might be of good use here. On the other hand: 1. Lucene is a rather large dependency, that does and is optimized to do a lot more than what's asked for 2. Hash tables are available for more or less every language on the planet, while lucene is limited to a few platforms.
Emil H
No war here. +1 for substantiating your argument.
hobodave
+3  A: 

The simple/naive way is to use a hashtable. Walk through the words and increment the count as you go.

At the end of the process sort the key/value pairs by count.

Sam Saffron
+2  A: 

the basic idea is simple -- in executable pseudocode,

from collections import defaultdict

def process(words):
  d = defaultdict(int)
  for w in words: d[w] += 1
  return d

Of course, the devil is in the details -- how do you turn the big collection into an iterator yielding words? Is it big enough that you can't process it on a single machine but rather need a mapreduce approach e.g. via hadoop? Etc, etc. NLTK can help with the linguistic aspects (isolating words in languages that don't separate them cleanly).

On a single-machine execution (net of mapreduce), one issue that can arise is that the simple idea gives you far too many singletons or thereabouts (words occurring once or just a few times), which fill memory. A probabilistic retort to that is to do two passes: one with random sampling (get only one word in ten, or one in a hundred) to make a set of words that are candidates for the top ranks, then a second pass skipping words that are not in the candidate set. Depending on how many words you're sampling and how many you want in the result, it's possible to compute an upper bound on the probability that you're going to miss an important word this way (and for reasonable numbers, and any natural language, I assure you that you'll be just fine).

Once you have your dictionary mapping words to numbers of occurrences you just need to pick the top N words by occurrences -- a heap-queue will help there, if the dictionary is just too large to sort by occurrences in its entirety (e.g. in my favorite executable pseudocode, heapq.nlargest will do it, for example).

Alex Martelli
+1  A: 

Look into the Apriori algorithm. It can be used to find frequent items and/or frequent sets of items.

Like the wikipedia article states, there are more efficient algorithms that do the same thing, but this could be a good start to see if this will apply to your situation.

scottman666
A: 

Why not a simple map with key as the word and the Counter as the Value. It will give the top used words, by taking the high value counter. It is just a O(N) operation.

Roopesh Majeti
This answer was already given. Twice.
hobodave