views:

1953

answers:

9

I'm working on a large project, I won't bother to summarize it here, but this section of the project is to take a very large document of text (minimum of around 50,000 words (not unique)), and output each unique word in order of most used to least used (probably top three will be "a" "an" and "the").

My question is of course, what would be the best sorting algorithm to use? I was reading of counting sort, and I like it, but my concern is that the range of values will be too large compared to the number of unique words.

Any suggestions?

+1  A: 

I'd start with a quicksort and go from there.

Check out the wiki page on sorting algorithms, though, to learn the differences.

Eric
+1 for the link. All programmers need at least a basic understanding in sorting algorithms.
Matthew Vines
+1  A: 

Have a look at the link. A Pictorial representation on how different algorithm works. This will give you an hint!

Sorting Algorithms

aJ
Awesome link, thanks!
Michael Haren
I like this one better http://vision.bc.edu/~dmartin/teaching/sorting/anim-html/all.html
Tom Leys
Both of those seem to suggest shell is best.
aterimperator
+2  A: 

You should try an MSD radix sort. It will sort your entries in lexicographical order. Here is a google code project you might be interested in.

JP Alioto
+9  A: 

First, you will need a map of word -> count. 50,000 words is not much - it will easily fit in memory, so there's nothing to worry. In C++ you can use the standard STL std::map.

Then, once you have the map, you can copy all the map keys to a vector.

Then, sort this vector using a custom comparison operator: instead of comparing the words, compare the counts from the map. (Don't worry about the specific sorting algorithm - your array is not that large, so any standard library sort will work for you.)

Igor Krivokon
+1 - 50,000 is not a very large document.
Eclipse
Even 500,000 words is easily manageable.
Niki Yoshiuchi
+1  A: 

This is a bit tricky because you want a map of words -> frequency, and you want to sort by the value rather than the key (which is common). There's a Java example here that shows how to do it using a custom comparator.

The particular algorithm you use isn't going to have much effect, so I'd just stick with your standard library implementation.

Bill the Lizard
+1  A: 

You can get better performance than quicksort with this particular problem assuming that if two words occur the same number of times, then it doesn't matter in which order you output them.

First step: Create a hash map with the words as key values and frequency as the associated values. You will fill this hash map in as you parse the file. While you are doing this, make sure to keep track of the highest frequency encountered. This step is O(n) complexity.

Second step: Create a list with the number of entries equal to the highest frequency from the first step. The index of each slot in this list will hold a list of the words with the frequency count equal to the index. So words that occur 3 times in the document will go in list[3] for example. Iterate through the hash map and insert the words into the appropriate spots in the list. This step is O(n) complexity.

Third step: Iterate through the list in reverse and output all the words. This step is O(n) complexity.

Overall this algorithm will accomplish your task in O(n) time rather than O(nlogn) required by quicksort.

MahlerFive
First step O(n*m) where n is number of words in input, m is number of unique words. Second step uses O(m) memory and does so with a random access pattern - awful for cache. If the third step was used to feed into another peice of code it would need to have o(n) memory allocated. - All this means your code would have poor memory performance which will dominate any apparent code improvements.
Tom Leys
If you used a really efficient hash the first step might only be O(m), if you are very lucky and there are no hash collisions.
Tom Leys
A: 

I think you want to do something as explained in the below post:

http://karephul.blogspot.com/2008/12/groovy-closures.html

Languages which support closure make the solution much easy, like LINQ as Eric mentioned.

A: 

In almost every case I've ever tested, Quicksort worked the best for me. However, I did have two cases where Combsort was the best. Could have been that combsort was better in those cases because the code was so small, or due to some quirk in how ordered the data was.

Any time sorting shows up in my profile, I try the major sorts. I've never had anything that topped both Quicksort and Combsort.

Nosredna
A: 

For large sets you can use what is known as the "sort based indexing" in information retrieval, but for 50,000 words you can use the following:

  • read the entire file into a buffer.
  • parse the buffer and build a token vector with struct token { char *term, int termlen; } term is a pointer to the word in the buffer.
  • sort the table by term (lexicographical order).
  • set entrynum = 0, iterate through the term vector, when term is new, store it in a vector : struct { char *term; int frequency; } at index entrynum, set frequency to 1 and increment the entry number, otherwise increment frequency.
  • sort the vector by frequency in descending order.
bill