views:

139

answers:

2

Hi,

Today I attended a written test conducted by a company. The overall test was focussed on data structures. I got a problem which I thought I solved. But I had a tough time in calculating the Big O function for the data structure. I will provide the question and the answer I came up with.

Given a document you need to store and the words in the document and should be able to return the count when any word is entered. You are provided with char* GetNextWord().

  1. What Data structure will you choose
  2. Give the Algorithm
  3. What will be the order of your algorithm

For question 1, I wrote I will go for TRIE data structure. For question 2, I gave a brief algorithm. I wrote I would construct the TRIE data structure as following.

struct TRIE{
 boolean isWord;
 int count;
 Node* myList;
}

struct Node{
 char* character;
 Node *next;
 TRIE *child;
}

I have methods constructTrie() which will do a addToTrie() for each word.

I wrote the order of addToTrie() would be O(k) where k is the length. And the order of constructTrie() would be N*O(k) where N would be the number of words.

Now my question is this: Whether the orders I've mentioned is correct or not? If not, how to attack problems like this in the future (given a ds finding the order). I got really confused after using O(k). It makes me assume O(1).

Hints/Tips/Advise are wide open!!

Edit : Corrected the question clearly mentioning that the word count should be stored for all unique words.

+2  A: 
KennyTM
+1  A: 

If you really want to use a trie, then addToTrie() would indeed be O(k) where k is the length of the word you are adding. constructTrie() would take O(Nk) where N is the number of words, if you just call addToTrie() for every word. However, you don't need to call the addToTrie() function for every word. Once you have finished adding a word, just reset a trie pointer to the trie's root, then move the pointer as you are moving over your current word, adding the characters as you go along. Pseudocode:

trieNode *curr = trieRoot;
for each character c in document
  if it's a word terminator (space etc)
    add a character at curr signaling the end of the current word ('\0' maybe);
    curr = trieRoot;
  else if character is not a separator
    add character c at curr->next->character[c];
    curr = curr->next;

This will give you O(C) running time for constructing the trie, where C is the number of characters in your document.

Now, this begs the question: why do you need the trie at all? Obviously you figured out a way to detect when a word has ended, so why must you add your words to a trie? It's overkill. The only datastructure you need is a few variables: one to keep track of the current character, one to keep track of the previous character and one to count the words. This is easily done in O(C) like this:

char prev = '\0';
char curr;
int count = 0;

for each character curr
  if curr is a word separator and prev isn't 
    ++count;
  prev = curr;

I think it makes no sense to use a trie for this problem, it's only complicating things. I think if they wanted to test your knowledge of tries they would have given you a problem where a trie made more sense.

Even if they gave you a getNextWord() function (did you HAVE to use it? because you can do better without it), I'm guessing it returns "\0" or something when there are no more words? So why can't you just call it until it returns "\0" and count the words like that? Either way, a trie doesn't really make sense here.

IVlad
Either you got my question wrong or I was not explanative. Its not only counting the words. Its counting the unique words. After constructing the data structure, I should be able to give the word count for any word being entered and I strongly back the idea of using a TRIE here.Apologies if my question was vague. Let me correct it
Bragboy
I'm sorry, I thought you only needed to count the words. In that case, ignore the second part of my post. A trie is a good call. My first part still stands though - you can build your trie in O(C) (assuming you don't HAVE to use the getNextWord() function, if you do your solution is fine) and answer any query for a word of length k in O(k).
IVlad
Thanks for understanding. According to your solution, we are able to find it in O(C) time, but that is equal to O(Nk) right? And the reason I have to use that getNextWord() is, that is the only public method available to me. I dont have a pointer to the document. Probably, I should be clearer next time when posting questions :)
Bragboy
O(C) and O(Nk) are the same thing in theory, but in practice my O(C) solution would be faster. If you have a procedure that inserts the word retrieved by getNextWord() into your trie, then first getNextWord() will have to get the word, which is O(k) and then your insertion procedure will take O(k) to insert it into the trie. So basically you will traverse each word, and therefore each character, twice, while my solution would only check each character once. Anyway, what I said doesn't apply if you must use getNextWord(), so the running time of your algorithm is O(Nk) or O(C).
IVlad
Under the given constraints of the problem statement, your solution is optimal both in theory and in practice. O(Nk) == O(C) because Nk is basically C (N number of words, k average word length, multiply them and you get an average character count, meaning C). I used O(C) to emphasize the fact that the algorithm is different and a bit better in general.
IVlad
I'm confused. How the 2nd code help to generate the count of a word (not the total word count)?
KennyTM
@KennyTM: it doesn't, I misunderstood the question at first. Only the first code is relevant, I left the second part just in case it might help someone.
IVlad