ansaurus

Question

Confusion in finding the order of the data structure

Answer 1

+2 A:

KennyTM 2010-02-27 07:47:16

Answer 2

+1 A:

If you really want to use a trie, then addToTrie() would indeed be O(k) where k is the length of the word you are adding. constructTrie() would take O(Nk) where N is the number of words, if you just call addToTrie() for every word. However, you don't need to call the addToTrie() function for every word. Once you have finished adding a word, just reset a trie pointer to the trie's root, then move the pointer as you are moving over your current word, adding the characters as you go along. Pseudocode:

trieNode *curr = trieRoot;
for each character c in document
  if it's a word terminator (space etc)
    add a character at curr signaling the end of the current word ('\0' maybe);
    curr = trieRoot;
  else if character is not a separator
    add character c at curr->next->character[c];
    curr = curr->next;

This will give you O(C) running time for constructing the trie, where C is the number of characters in your document.

Now, this begs the question: why do you need the trie at all? Obviously you figured out a way to detect when a word has ended, so why must you add your words to a trie? It's overkill. The only datastructure you need is a few variables: one to keep track of the current character, one to keep track of the previous character and one to count the words. This is easily done in O(C) like this:

char prev = '\0';
char curr;
int count = 0;

for each character curr
  if curr is a word separator and prev isn't 
    ++count;
  prev = curr;

I think it makes no sense to use a trie for this problem, it's only complicating things. I think if they wanted to test your knowledge of tries they would have given you a problem where a trie made more sense.

Even if they gave you a getNextWord() function (did you HAVE to use it? because you can do better without it), I'm guessing it returns "\0" or something when there are no more words? So why can't you just call it until it returns "\0" and count the words like that? Either way, a trie doesn't really make sense here.

IVlad 2010-02-27 08:51:20

Either you got my question wrong or I was not explanative. Its not only counting the words. Its counting the unique words. After constructing the data structure, I should be able to give the word count for any word being entered and I strongly back the idea of using a TRIE here.Apologies if my question was vague. Let me correct it

Bragboy 2010-02-27 10:52:31

I'm sorry, I thought you only needed to count the words. In that case, ignore the second part of my post. A trie is a good call. My first part still stands though - you can build your trie in O(C) (assuming you don't HAVE to use the getNextWord() function, if you do your solution is fine) and answer any query for a word of length k in O(k).

IVlad 2010-02-27 11:22:05

Thanks for understanding. According to your solution, we are able to find it in O(C) time, but that is equal to O(Nk) right? And the reason I have to use that getNextWord() is, that is the only public method available to me. I dont have a pointer to the document. Probably, I should be clearer next time when posting questions :)

Bragboy 2010-02-27 17:11:01

O(C) and O(Nk) are the same thing in theory, but in practice my O(C) solution would be faster. If you have a procedure that inserts the word retrieved by getNextWord() into your trie, then first getNextWord() will have to get the word, which is O(k) and then your insertion procedure will take O(k) to insert it into the trie. So basically you will traverse each word, and therefore each character, twice, while my solution would only check each character once. Anyway, what I said doesn't apply if you must use getNextWord(), so the running time of your algorithm is O(Nk) or O(C).

IVlad 2010-02-27 17:24:25

Under the given constraints of the problem statement, your solution is optimal both in theory and in practice. O(Nk) == O(C) because Nk is basically C (N number of words, k average word length, multiply them and you get an average character count, meaning C). I used O(C) to emphasize the fact that the algorithm is different and a bit better in general.

IVlad 2010-02-27 17:26:47

I'm confused. How the 2nd code help to generate the count of a word (not the total word count)?

KennyTM 2010-02-28 07:45:38

@KennyTM: it doesn't, I misunderstood the question at first. Only the first code is relevant, I left the second part just in case it might help someone.

IVlad 2010-02-28 08:29:12

ansaurus

tags:

views:

answers:

Confusion in finding the order of the data structure

related questions