views:

398

answers:

10

If you have 65536 random English words with length 1-32 per word that you need to count for appearance and sort based on dictionary or appearance rank, how do you structure the data and what sorting technique would you use to process it the fastest?

+15  A: 

65,000 words is, in all seriousness, a minuscule sorting problem. Unless you have to re-sort it many times per minute, I would suggest you just use the qsort() built into the language. That's what it's for, after all.

I'd suggest using a simple array of char pointers for the alpha order. For maintaining the frequency order, you'll can use a structure such as:

typedef struct {
    char *word;      // points to one of the strings.
    int frequency;   // counts the number of occurrences.
} tFreq;

in another array, which you can populate fully whenever you create or modify the alpha-sorted array of pointers (see below for my reasoning why this seemingly inefficient process is suitable).

As an example of the speed, consider the following piece of code:

#include <stdio.h>
#define MAXWDS 66000

static char *words[MAXWDS];

static int compFn (const void *p1, const void *p2) {
    return strcmp (*((const char **)p1), *((const char **)p2));
}

int main() {
    char *pWord;
    int i, j, sz;
    time_t t0, t1;

    srand (time (0));
    for (i = 0; i < MAXWDS; i++) {
        sz = rand() % 32 + 1;
        pWord = words[i] = malloc (sz + 1);
        for (j = 0; j < sz; j++)
            pWord[j] = 'A' + (rand() % 26);
        pWord[sz] = '\0';
    }

    t0 = time(0);
    qsort (words, MAXWDS, sizeof (char*), compFn);
    t1 = time(0);

    printf ("Time taken for %7d elements was %2d second(s)\n", MAXWDS, t1 - t0);
    return 0;
}

On a 3GHz dual core Intel chip, here's the outputs from a few select values of MAXWDS:

   MAXWDS   Output
---------   ------
   66,000   Time taken for   66000 elements was  0 second(s)
  100,000   Time taken for  100000 elements was  0 second(s)
  500,000   Time taken for  500000 elements was  0 second(s)
  600,000   Time taken for  600000 elements was  1 second(s)
1,000,000   Time taken for 1000000 elements was  1 second(s)
2,000,000   Time taken for 2000000 elements was  2 second(s)
3,000,000   Time taken for 3000000 elements was  5 second(s)
4,000,000   Time taken for 4000000 elements was  7 second(s)
5,000,000   Time taken for 5000000 elements was  9 second(s)
6,000,000   Time taken for 6000000 elements was 10 second(s)
7,000,000   Time taken for 7000000 elements was 11 second(s)
9,999,999   Time taken for 9999999 elements was 21 second(s)

So, as you can see, qsort is fairly efficient for the data set sizes you're talking about.

In fact, implementing the whole process for maintaining two sorted lists, as shown in the code below, shows you exactly how insignificant 66,000 elements are. The basic premise is to:

  • modify the alpha sorted strings as needed, then do a full alpha sort of them (t0 to t1).
  • use the fact that they're sorted to easily transfer them to another array but with only one element per word, and the frequency (t1 to t2).
  • sort that frequency array as well (t2 to t3).

The following code shows how that's done. The only mildly tricky bit is the transfer from the alpha array to the frequency array.

#include <stdio.h>

#define MAXWDS 66000
typedef struct {
    char *word;
    int frequency;
} tFreq;
static char *words[MAXWDS];
static tFreq freq[MAXWDS];
static int numFreq;

static int compFn (const void *p1, const void *p2) {
    return strcmp (*((const char **)p1), *((const char **)p2));
}

static int compFn2 (const void *p1, const void *p2) {
    return ((tFreq*)p2)->frequency - ((tFreq*)p1)->frequency;
}

 

int main() {
    char *pWord;
    int i, j, sz;
    time_t t0, t1, t2, t3;

    // Generate random words.
    srand (time (0));
    for (i = 0; i < MAXWDS; i++) {
        sz = rand() % 32 + 1;
        pWord = words[i] = malloc (sz + 1);
        for (j = 0; j < sz; j++)
            pWord[j] = 'A' + (rand() % 26);
        pWord[sz] = '\0';
    }

    t0 = time(0);

    // Alpha sort.
    qsort (words, MAXWDS, sizeof (char*), compFn);
    t1 = time(0);

 

    // Pre-condition to simplify loop: make first word with zero frequency.

    freq[0].word = words[0];
    freq[0].frequency = 0;

    // Transfer to frequency array.

    for (i = numFreq = 0; i < MAXWDS; i++) {
        // If new word, add it and set frequency to 0.
        if (strcmp (freq[numFreq].word, words[i]) != 0) {
            numFreq++;
            freq[numFreq].word = words[i];
            freq[numFreq].frequency = 0;
        }

        // Increment frequency (for old and new words).
        freq[numFreq].frequency++;
    }
    numFreq++;
    t2 = time(0);

    // Sort frequency array.
    qsort (freq, numFreq, sizeof (tFreq), compFn2);
    t3 = time(0);

 

    // Output stats.
    printf ("Time taken for      sorting %5d elements was %d seconds.\n",
        MAXWDS, t1 - t0);
    printf ("Time taken for transferring %5d elements was %d seconds.\n",
        numFreq, t2 - t1);
    printf ("Time taken for      sorting %5d elements was %d seconds.\n",
        numFreq, t3 - t2);
    printf ("Time taken for   everything %5s          was %d seconds.\n\n",
        "", t3 - t0);
    for (i = 0; i < 28; i++) {
        printf ("[%-15s] %5d\n", freq[i].word, freq[i].frequency);
    }

    return 0;
}

And the output for 66,000 random strings is (the first few strings are there so you can see that the sort worked):

Time taken for      sorting 66000 elements was 0 seconds.
Time taken for transferring 62422 elements was 0 seconds.
Time taken for      sorting 62422 elements was 0 seconds.
Time taken for   everything                was 0 seconds.

[Z              ]   105
[H              ]    97
[X              ]    95
[P              ]    90
[D              ]    89
[K              ]    87
[T              ]    86
[J              ]    85
[G              ]    84
[F              ]    83
[Q              ]    83
[W              ]    81
[V              ]    81
[M              ]    80
[I              ]    79
[O              ]    78
[A              ]    78
[B              ]    75
[U              ]    74
[N              ]    73
[C              ]    73
[S              ]    70
[Y              ]    68
[L              ]    65
[E              ]    60
[R              ]    59
[NQ             ]     8
[XD             ]     8

So, even if you do those operations every single time you insert or delete a value, they will have no discernible impact (unless obviously, if you're doing it more than once every couple of seconds, but then you would think about batching the changes for efficiency).

paxdiablo
+1  A: 

Bucket sort?

Dutow
A: 

Merge Sort should work well for this and is pretty easy to get working in c.

DShook
+1  A: 

I would use whatever sorting algorithm my runtime library happened to provide. Typically, sort() uses quicksort.

Don't worry about the choice of sort algorithm until you know that your standard one isn't working well enough because you've measured it.

Roddy
A: 

You can use

struct elem { char *word, int frequency; } // pointer to 'string' word
struct elem dict[1<<16]; // number of words

use the standard qsort to sort by word or by frequency, or use a second array if you need both orders a the same time.

bill
+4  A: 

check out http://www.sorting-algorithms.com/ for a nice visual representation of different sorting methods.

henrikpp
Great site, really clear.
Mark Pattison
I agree... great site.
Skurmedel
A: 

Choosing a sorting algorithm depends on the amount of data (65k aren't many) you have and the tradeoff beetween time and memory you choose. If the data should be retrieved really fast, you will have to use more memory. On the other hand you can't find the records that fast if you decide to save on memory.

The choice of algorithm is pretty easy - use whatever your languages library offers unless you have proof that this doesn't work well enough.

You need the data sorted by two criteria, so you actually need two sorted arrays. Both of them should be arrays of pointers of some kind.

marvesmith
A: 

It sounds as though you must sort it in two different ways:

  1. While reading the input file, when you don't yet know all words that are in the input: binary search to test if the word is already in the table, insertion sort, if it is not (using lexical order for both of these algorithms.)
  2. When the list and frequencies are complete, sort again by word frequency (using quicksort or possibly merge sort)
finnw
A: 

Use a trie. That way, both "sorts" will be simple traversals of the graph.

Christoffer
+2  A: 

Oh dear, what a poorly worded homework question - the tutor should know better than this. The last part is "to process it the fastest". Unfortunately, it is very, very difficult to determine how long an algorithm will take to execute. Big O notation doesn't help as that is a measure of complexity, not speed (for more information about this, see Raymond Chen's recent blog entry). The only practical way is to implement the algorithm, run it and measure the time taken. Also, the input data can affect the execution time - qsort and binary trees are non-optimal for data that is already sorted.

You could write an entire paper on "fastest" and still not get a concrete answer.

Skizz

Skizz