views:

834

answers:

8

The code below will print me the highest frequency it can find in my hash table (of which is a bunch of linked lists) 10 times. I need my code to print the top 10 frequencies in my hash table. I do not know how to do this (code examples would be great, plain english logic/pseudocode is just as great).

  1. I create a temporary hashing list called 'tmp' which is pointing to my hash table 'hashtable'
  2. A while loop then goes through the list and looks for the highest frequency, which is an int 'tmp->freq'
  3. The loop will continue this process of duplicating the highest frequency it finds with the variable 'topfreq' until it reaches the end of the linked lists on the the hash table.

My 'node' is a struct comprising of the variables 'freq' (int) and 'word' (128 char). When the loop has nothing else to search for it prints these two values on screen.

The problem is, I can't wrap my head around figuring out how to find the next lowest number from the number I've just found (and this can include another node with the same freq value, so I have to check that the word is not the same too).

void toptenwords()
{
    int topfreq = 0;
    int minfreq = 0;
    char topword[SIZEOFWORD];

    for(int p = 0; p < 10; p++) // We need the top 10 frequencies... so we do this 10 times
    {
        for(int m = 0; m < HASHTABLESIZE; m++) // Go through the entire hast table
        {
            node* tmp;
            tmp = hashtable[m];

            while(tmp != NULL) // Walk through the entire linked list
            {
                if(tmp->freq > topfreq) // If the freqency on hand is larger that the one found, store...
                {
                    topfreq = tmp->freq;
                    strcpy(topword, tmp->word);
                }
                tmp = tmp->next;
            }
        }
        cout << topfreq << "\t" << topword << endl;
    }
}

Any and all help would be GREATLY appreciated :)

A: 

I would maintain a set of words already used and change the inner-most if condition to test for frequency greater than previous top frequency AND tmp->word not in list of words already used.

Alex B
A: 

When iterating over the hash table (and then over each linked list contained therein) keep a self balancing binary tree (std::set) as a "result" list. As you come across each frequency, insert it into the list, then truncate the list if it has more than 10 entries. When you finish, you'll have a set (sorted list) of the top ten frequencies, which you can manipulate as you desire.

There may be perform gains to be had by using sets instead of linked lists in the hash table itself, but you can work that out for yourself.

As one of the requirements I'm not able to use a binary tree. Thank you very much for your suggestion, it was very appreciated!
tmhai
@tmhai: One of the "requirements" is not being allowed to use a binary tree? Is this homework by any chance? (If so, fine, but tag your question as such.)
j_random_hacker
+1  A: 

Keep an array of 10 node pointers, and insert each node into the array, maintaining the array in sorted order. The eleventh node in the array is overwritten on each iteration and contains junk.

void toptenwords()
{
        int topfreq = 0;
        int minfreq = 0;
        node *topwords[11];
        int current_topwords = 0;

        for(int m = 0; m < HASHTABLESIZE; m++) // Go through the entire hast table
        {
                node* tmp;
                tmp = hashtable[m];

                while(tmp != NULL) // Walk through the entire linked list
                {
                        topwords[current_topwords] = tmp;
                        current_topwords++;
                        for(int i = current_topwords - 1; i > 0; i--)
                        {
                                if(topwords[i]->freq > topwords[i - 1]->freq)
                                {
                                        node *temp = topwords[i - 1];
                                        topwords[i - 1] = topwords[i];
                                        topwords[i] = temp;
                                }
                                else break;
                        }
                        if(current_topwords > 10) current_topwords = 10;
                        tmp = tmp->next;
                }
        }
}
Steve McKay
A: 

Step 1 (Inefficient):

Move the vector into a sorted container via insertion sort, but insert into a container (e.g. linkedlist or vector) of size 10, and drop any elements that fall off the bottom of the list.

Step 2 (Efficient):

Same as step 1, but keep track of the size of the item at the bottom of the list, and skip the insertion step entirely if the current item is too small.

Brian
A: 

A hash table containing linked lists of words seems like a peculiar data structure to use if the goal is to accumulate are word frequencies.

Nonetheless, the efficient way to get the ten highest frequency nodes is to insert each into a priority queue/heap, such as the Fibonacci heap, which has O(1) insertion time and O(n) deletion time. Assuming that iteration over the hash table table is fast, this method has a runtime which is O(n×O(1) + 10×O(n)) ≡ O(n).

+1. Two things: Deletion time for all heaps that I know of (including Fibonacci) is O(log n), not O(n); and if n >> 10 it will be much faster to keep a min-heap (rather than a max-heap), to which you insert the next element and remove the smallest on each iteration (so that the heap only ever contains at most 10 elements).
j_random_hacker
O(log(n)) ≡ O(n) is its amortized deletion time - since this algorithm involves only deleting some of the keys an amortized analysis isn’t valid. The Fibonacci heap’s worst case deletion is O(n).
Good point. So in the general case, when we want the top k frequencies, your algorithm will be O(kn) -- i.e. the same as the obvious solution of running the first k iterations of a selection sort. See my answer for an O(nlog(k)) algo that uses a plain binary heap (rather than a Fibonacci heap, which is trickier to implement).
j_random_hacker
A: 

Suppose there are n words in total, and we need the most-frequent k words (here, k = 10).

If n is much larger than k, the most efficient way I know of is to maintain a min-heap (i.e. the top element has the minimum frequency of all elements in the heap). On each iteration, you insert the next frequency into the heap, and if the heap now contains k+1 elements, you remove the smallest. This way, the heap is maintained at a size of k elements throughout, containing at any time the k highest-frequency elements seen so far. At the end of processing, read out the k highest-frequency elements in increasing order.

Time complexity: For each of n words, we do two things: insert into a heap of size at most k, and remove the minimum element. Each operation costs O(log k) time, so the entire loop takes O(nlog k) time. Finally, we read out the k elements from a heap of size at most k, taking O(klog k) time, for a total time of O((n+k)log k). Since we know that k < n, O(klog k) is at worst O(nlog k), so this can be simplified to just O(nlog k).

j_random_hacker
A: 

I eventually figured it out...

void toptenwords()
{
 int topfreq = 0;
 char topword[SIZEOFWORD];
 int counter = 0;

 cout << "\nTop Ten Words" << endl;

 for(int m = 0; m < HASHTABLESIZE; m++) // We need to find the highest frequency first...
 {
  node* tmp;
  tmp = hashtable[m];

  while(tmp != NULL) // Walk through the entire linked list
  {
   if(tmp->freq > topfreq) // If the freqency on hand is larger that the one found, store...
   {
    topfreq = tmp->freq;
    strcpy(topword, tmp->word);
   }
   tmp = tmp->next;
  }
 }

 while(topfreq > 0 && counter < 10) // This will now find the top 10 frequencies
 {  
  for(int m = 0; m < HASHTABLESIZE; m++) // Go through the entire hast table
  {
   node* tmp;
   tmp = hashtable[m];

   while(tmp != NULL) // Walk through the entire linked list
   {
    if(tmp->freq == topfreq) // If the freqency we're on is equal to the frequency we're keeping count of...
    {
     counter++;
     if(counter > 10) // We only want the top 10 words
      break;
     topfreq = tmp->freq; // Store the node details...
     strcpy(topword, tmp->word);
     cout << topfreq << "\t" << topword << endl;
    }
    tmp = tmp->next;
   }
  }

  topfreq--; // If counter is never incremented again, this will surely kill the loop... eventually.
 }
}

It takes about 30-60 seconds to execute, but it does the job. I don't know much about efficiency (obviously noted from the amount of time it takes).

tmhai
Of *all* the solutions posted to date, this is the slowest way to do it (unless you expect the top 10 frequencies will always contain some duplicate frequencies). If the top frequency is 1,000 and all other words appear with frequency 1, you will go through the entire hashtable 999 times for nothing. O(tn) where t is the highest frequency. -1.
j_random_hacker
You're totally right, I accept total responsibility for the lack of efficiency in this code. However, I needed a solution and I found it. I appreciate everyone's response to my initial question. Thank you.
tmhai
...ah, also forgot to mention. There very well could be duplicates in frequencies. Wouldn't let it pass you when testing for every scenario ;)
tmhai
Every solution posted, including yours, works *correctly* when duplicate frequencies are present -- what I meant is that yours is *faster* than some others when there are duplicates (and slower otherwise). E.g. if there are 10 top-equal frequencies then your algorithm only makes 2 passes through the hashtable -- it's actually the fastest of all in this case :) BTW if you appreciate a solution on this website, you can always give it a +1.
j_random_hacker
A: 

The absolute fastest way to do this would be to use a SoftHeap. Using a SoftHeap, you can find the top 10 items in O(n) time whereas every other solution posted here would take O(n lg n) time.

http://en.wikipedia.org/wiki/Soft_heap

This wikipedia article shows how to find the median in O(n) time using a softheap, and the top 10 is simply a subset of the median problem. You could then sort the items that were in the top 10 if you needed them in order, and since you're always at most sorting 10 items, it's still O(n) time.

LPCRoy