Most efficient sorting algorithm for a large set of numbers

views:

1953

answers:

+2 Q:

Most efficient sorting algorithm for a large set of numbers

I'm working on a large project, I won't bother to summarize it here, but this section of the project is to take a very large document of text (minimum of around 50,000 words (not unique)), and output each unique word in order of most used to least used (probably top three will be "a" "an" and "the").

My question is of course, what would be the best sorting algorithm to use? I was reading of counting sort, and I like it, but my concern is that the range of values will be too large compared to the number of unique words.

Any suggestions?

+1 A:

I'd start with a quicksort and go from there.

Check out the wiki page on sorting algorithms, though, to learn the differences.

Eric 2009-06-05 03:41:40

+1 for the link. All programmers need at least a basic understanding in sorting algorithms.

Matthew Vines 2009-06-05 03:44:08

+1 A:

Have a look at the link. A Pictorial representation on how different algorithm works. This will give you an hint!

Sorting Algorithms

aJ 2009-06-05 03:49:00

Awesome link, thanks!

Michael Haren 2009-06-05 04:26:55

I like this one better http://vision.bc.edu/~dmartin/teaching/sorting/anim-html/all.html

Tom Leys 2009-06-05 04:34:07

Both of those seem to suggest shell is best.

aterimperator 2009-06-07 03:22:29

+2 A:

You should try an MSD radix sort. It will sort your entries in lexicographical order. Here is a google code project you might be interested in.

JP Alioto 2009-06-05 03:50:58

+9 A:

First, you will need a map of word -> count. 50,000 words is not much - it will easily fit in memory, so there's nothing to worry. In C++ you can use the standard STL std::map.

Then, once you have the map, you can copy all the map keys to a vector.

Then, sort this vector using a custom comparison operator: instead of comparing the words, compare the counts from the map. (Don't worry about the specific sorting algorithm - your array is not that large, so any standard library sort will work for you.)

Igor Krivokon 2009-06-05 03:58:54

+1 - 50,000 is not a very large document.

Eclipse 2009-06-05 04:27:43

Even 500,000 words is easily manageable.

Niki Yoshiuchi 2009-06-05 20:15:21

+1 A:

This is a bit tricky because you want a map of words -> frequency, and you want to sort by the value rather than the key (which is common). There's a Java example here that shows how to do it using a custom comparator.

The particular algorithm you use isn't going to have much effect, so I'd just stick with your standard library implementation.

Bill the Lizard 2009-06-05 04:06:02

+1 A:

You can get better performance than quicksort with this particular problem assuming that if two words occur the same number of times, then it doesn't matter in which order you output them.

First step: Create a hash map with the words as key values and frequency as the associated values. You will fill this hash map in as you parse the file. While you are doing this, make sure to keep track of the highest frequency encountered. This step is O(n) complexity.

Second step: Create a list with the number of entries equal to the highest frequency from the first step. The index of each slot in this list will hold a list of the words with the frequency count equal to the index. So words that occur 3 times in the document will go in list[3] for example. Iterate through the hash map and insert the words into the appropriate spots in the list. This step is O(n) complexity.

Third step: Iterate through the list in reverse and output all the words. This step is O(n) complexity.

Overall this algorithm will accomplish your task in O(n) time rather than O(nlogn) required by quicksort.

MahlerFive 2009-06-05 04:14:25

First step O(n*m) where n is number of words in input, m is number of unique words. Second step uses O(m) memory and does so with a random access pattern - awful for cache. If the third step was used to feed into another peice of code it would need to have o(n) memory allocated. - All this means your code would have poor memory performance which will dominate any apparent code improvements.

Tom Leys 2009-06-05 04:38:05

If you used a really efficient hash the first step might only be O(m), if you are very lucky and there are no hash collisions.

Tom Leys 2009-06-05 04:39:38

I think you want to do something as explained in the below post:

http://karephul.blogspot.com/2008/12/groovy-closures.html

Languages which support closure make the solution much easy, like LINQ as Eric mentioned.

2009-06-05 18:31:30

In almost every case I've ever tested, Quicksort worked the best for me. However, I did have two cases where Combsort was the best. Could have been that combsort was better in those cases because the code was so small, or due to some quirk in how ordered the data was.

Any time sorting shows up in my profile, I try the major sorts. I've never had anything that topped both Quicksort and Combsort.

Nosredna 2009-06-05 18:38:37

For large sets you can use what is known as the "sort based indexing" in information retrieval, but for 50,000 words you can use the following:

read the entire file into a buffer.
parse the buffer and build a token vector with struct token { char *term, int termlen; } term is a pointer to the word in the buffer.
sort the table by term (lexicographical order).
set entrynum = 0, iterate through the term vector, when term is new, store it in a vector : struct { char *term; int frequency; } at index entrynum, set frequency to 1 and increment the entry number, otherwise increment frequency.
sort the vector by frequency in descending order.

bill 2009-06-13 09:02:50

ansaurus

tags:

views:

answers:

Most efficient sorting algorithm for a large set of numbers

related questions