views:

407

answers:

2

hey guys,

please help me in finding the optimized solutions to these interesting data structure questions:

  1. given a file containing approx 10 million words, design a data structure for finding the anagrams
  2. Write a program to display the ten most frequent words in a file such that your program be efficient in all complexity measures.
  3. you have a file with millions of lines of data. Only two lines are identical; the rest are all unique. Each line is so long that it may not even fit in the memory. What is the most efficient solution for finding the identical lines?

Adding some more questions:

4) (asked by MS) You are given an array of Strings of length 3. One of the string in the array is marked as Start string and another one as End string. You have to convert start string to end string, given the condition that the intermediate string which you will make should differ from its previous string by only one character and the string should be present in the input array. eg. If input is Array: {"fat", "tab", "eat", "see", "tub", "fab", "rat", "sel"} Start: "fat" End: "tub" Then the output should be fat -> fab -> tab -> tub

I had tried to solve the third one and had come up with two possible appraoches: 1) Read only the first word of all the lines and then eliminate all those lines whose first word does not match the first word of any other line. Keep getting the successive words of the remaining lines in this manner until you are left with just two lines. You got your answer! 2) Convert each line into a smaller representation. This can be achieved by coding each word in short binary form and then XORing the bits representing each line.

Edit: I now have a good collection of data-structure problems with me, if anyone is interested in discussing them here, then I can post some more.

A: 

I think the answer to Q2 can be to use a Hashmap or a TRIE to store the word and its frequency. Both of these structures provide good lookup order. For each word in file, check if the word already exists. If yes, increase its count else add the word and initialize the count to 1.

This question was recently asked to one of my friends in Microsoft interview. The solution he proposed was similar. Maintain a HashMap and a doubly linked list. DLL will be sorted based on the word frequencies. Any time a new word is added to the document, the DLL will be modified accordingly to reflect the current sorted list.

I think the most efficient solution is by using TRIE with a doubly linked list.

Ashish
Thank you for that! You're a genius!
Yehonatan
A: 

The solution which I proposed for the 4th question was to make a graph of all the words in the array. First create an adjacency matrix which will have a value 1 if a word can be directly converted to another word.

Now use depth first search to find a path from Start string to End string and return the path if found.

Note that you will have to modify the depth first search here to store the path which you have traversed till now. For this, instead of pushing the node which is being visited to Stack, I pushed the complete path till now to the stack, so that when the element is found, I have the complete path from the Start string to the End string.

Ashish
No wait! A double genius!
Yehonatan