views:

1302

answers:

8

The question can be described as: Input: A positive integer K and a big text. The text can actually be viewed as word sequence. So we don't have to worry about how to break down it into word sequence. Output: The most frequent K words in the text.

My thinking is like this.

1) use a Hash table to record all words' frequency while traverse the whole word sequence. In this phase, the key is "word" and the value is "word-frequency". This takes O(n) time.

2) sore the (word, word-frequency) pair; and the key is "word-frequency". This takes O(n*lg(n)) time with normal sorting algorithm.

3) After sorting, we just take the first K words. This takes O(K) time.

To summarize, the total time is O(n+n*lg(n)+K), Since K is surely smaller than N, so it is actually O(n*lg(n)).

We can improve this. Actually, we just want top K words. Other words' frequency is not concern for us. So, we can use "partial Heap sorting". For step 2) and 3), we don't just do sorting. Instead, we change it to be

2') build a heap of (word, word-frequency) pair with "word-frequency" as key. It takes O(n) time to build a heap;

3') extract top K words from the heap. Each extraction is O(lg(n)). So, total time is O(k*lg(n)).

To summarize, this solution cost time O(n+k*lg(n)).

This is just my thought. I haven't find out way to improve step 1). I Hope some Information Retrieval experts can shed more light on this question. Thanks.

+4  A: 

If your "big word list" is big enough, you can simply sample and get estimates. Otherwise, I like hash aggregation.

Edit:

By sample I mean choose some subset of pages and calculate the most frequent word in those pages. Provided you select the pages in a reasonable way and select a statistically significant sample, your estimates of the most frequent words should be reasonable.

This approach is really only reasonable if you have so much data that processing it all is just kind of silly. If you only have a few megs, you should be able to tear through the data and calculate an exact answer without breaking a sweat rather than bothering to calculate an estimate.

Aaron Maenpaa
Sometimes you have to do this many times over, for example if you're trying to get the list of frequent words per website, or per subject. In that case, "without breaking a sweat" doesn't really cut it. You still need to find a way to do it in as efficiently as possible.
itsadok
A: 

Hi Zach, What do you mean by "sample and get estimate"?

Morgan Cheng
He is referring to a statistical method - you take a selection of entries and use that as an estimate of the true frequency
1800 INFORMATION
+3  A: 

You can cut down the time further by partitioning using the first letter of words, then partitioning the largest multi-word set using the next character until you have k single-word sets. You would use a sortof 256-way tree with lists of partial/complete words at the leafs. You would need to be very careful to not cause string copies everywhere.

This algorithm is O(m), where m is the number of characters. It avoids that dependence on k, which is very nice for large k [by the way your posted running time is wrong, it should be O(n*lg(k)), and I'm not sure what that is in terms of m].

If you run both algorithms side by side you will get what I'm pretty sure is an asymptotically optimal O(min(m, n*lg(k))) algorithm, but mine should be faster on average because it doesn't involve hashing or sorting.

What you're describing is called a 'trie'.
Nick Johnson
Hi Strilanc.Can you explain the process of partition in details?
Morgan Cheng
A: 

Hi Strilanc, Thanks for your comments. You really bring new light to this question.

Morgan Cheng
you should post this as a comment, not an answer.
Brann
+1  A: 

You're not going to get generally better runtime than the solution you've described. You have to do at least O(n) work to evaluate all the words, and then O(k) extra work to find the top k terms.

If your problem set is really big, you can use a distributed solution such as map/reduce. Have n map workers count frequencies on 1/nth of the text each, and for each word, send it to one of m reducer workers calculated based on the hash of the word. The reducers then sum the counts. Merge sort over the reducers' outputs will give you the most popular words in order of popularity.

Nick Johnson
A: 

Hi Strilanc,

Suppose we have a word sequence "ad" "ad" "boy" "big" "bad" "com" "come" "cold". And K=2. as you mentioned "partitioning using the first letter of words", we got ("ad", "ad") ("boy", "big", "bad") ("com" "come" "cold") "then partitioning the largest multi-word set using the next character until you have k single-word sets." it will partition ("boy", "big", "bad") ("com" "come" "cold"), the first partition ("ad", "ad") is missed, while "ad" is actually the most frequent word.

Perhaps I misunderstand your point. Can you please detail your process about partition?

Morgan Cheng
A: 

You have a bug in your description: Counting takes O(n) time, but sorting takes O(m*lg(m)), where m is the number of unique words. This is usually much smaller than the total number of words, so probably should just optimize how the hash is built.

martinus
A: 

A small variation on your solution yields an O(n) algorithm if we don't care about ranking the top K, and a O(n+k*lg(k)) solution if we do. I believe both of these bounds are optimal within a constant factor.

The optimization here comes again after we run through the list, inserting into the hash table. We can use the median of medians algorithm to select the Kth largest element in the list. This algorithm is provably O(n).

After selecting the Kth smallest element, we partition the list around that element just as in quicksort. This is obviously also O(n). Anything on the "left" side of the pivot is in our group of K elements, so we're done (we can simply throw away everything else as we go along).

So this strategy is:

  1. Go through each word and insert it into a hash table: O(n)
  2. Select the Kth smallest element: O(n)
  3. Partition around that element: O(n)

If you want to rank the K elements, simply sort them with any efficient comparison sort in O(k * lg(k)) time, yielding a total run time of O(n+k * lg(k)).

The O(n) time bound is optimal within a constant factor because we must examine each word at least once.

The O(n + k * lg(k)) time bound is also optimal because there is no comparison-based way to sort k elements in less than k * lg(k) time.