tags:

views:

391

answers:

4

Hello,

I am stuck trying to figure out how to do string hashing with linear probing.

Basically, the idea is to hash every string from a dictionary (90000 words), and retrieve anagrams of selected words.

Here's what I did:

  1. created a hash table 2*90000 in size

  2. using a simple hash function, I hash each word from the dictionary, get a value

  3. check if that hash table index is empty, if it is, assign the value, if not, generate a new hash value.

  4. after every word is in the hash table, and I perform a search

  5. the search word will receive a hash value after the hash function, and it will be checked whether that value exists in the hash table or not.

  6. if it exists, it will compare the string using permutations. if the match is true, it will output it. if not, it will keep looking using a new hash value.

problem is, the whole process is extremely slow... it indexes fine, but searching takes REALLY long time.

I am out of ideas on how to make this faster..

Thank you for your time reading this.

A: 

Are you attempting to create Anagrams of a given string? In that case, just create an anagram on getting the string as input. What's the point in hashing those strings?

EDIT: First of all I would suggest you to get all the permutations of the given string and then loop through the dictionary file having all the words. This has the advantage that you don't need to have all the words in memory. If you want to optimize further, sort the entire file based on the string lengths by ascending or descending order and keep checking for the strings in the dictionary file till you are <= to the next string's length.

Jagannath
to search whether that anagram is a valid word or not
Neeraj
The anagrams needs to be from the dictionary. Since there's 90000 words, it should be slower to traverse through them one by one using permutations..
tpae
+3  A: 

Put all the letters in alphabetical order first, then hash the result with any hashing algorithm you please (crc32, md5sum, sha1, count the vowels, anything... though counting the vowels will lead to a less-efficient solution), and store the word as a leaf node to that hash entry (in a linked list, obviously) -- do a mod(x) on the hash result to limit the buckets to 2^x.

Then, when you go to find an anagram, do the exact same "insert" procedure on your test word: alphabetize the letters, then run it through your same hash function. Then for each leaf node, compare the alphabetized letter list with the saved word's alphabetized list. Each match is an anagram.

(I normally don't like to give homework help, but this one was too tempting. Now I kind of want to go write a fun little program to find all the anagrams in a given dictionary.)

tylerl
The question is about linear probing - he should store the word in the next open bucket, not in the (non-existent) leaf node. But your suggestion to sort the letters is better than adapting the hash algorithm itself.
Joe
+1 for sorting the letters, for anagrams it really makes sense and you even get the benefit that all anagrams ends up in the same bucket.
Matthieu M.
+1  A: 

Linear Probing is used in the case when the hash function you are using gives collision for some input string.In that case you search sequentially the hash table until you find your search key.

This method has a disadvantage in that if there is one collision,there will be more.

One way is that you can use Separate Chaining and use balanced trees as buckets to improve lookup.

If you just want to improve performance, make sure that there are no collisions (in that case lookup is perfectly O(1)) and if there are, increase the hashtable size.

Neeraj
Sorry, but homework requires me to use linear probing.. :(
tpae
A: 

If you are searching each permute of your input word, this is the source of your problem. The number of permutations of the letters of a word can get quite large.

Instead, pick a hash function which is the same for any permutation (anagram) of a word. For example, one based on the sum of character codes of the characters in the word.

ergosys
While this solves the immediate problem, it's an extremely poor algorithm for hashing - there will be a ton of collisions. A better solution is the one proposed by tylerl, which can use a good hashing algorithm and still cope with the large number of permutations.
Joe
I didn't say what the hash function was, just that it was the same for any permutation. If you sort the characters first, this satisfies my criterion. Guilty of a bad example though.
ergosys
This won't work, since not all words are same length.
tpae
If they aren't the same length, how can they be anagrams? Or am I misunderstanding the problem?
ergosys
basically, all the words in the dictionary are different lengths. Let's say the word I am searching is a "step", the word "pets" is in the dictionary. let's say, (in example, "step" is 1+2+3+4 = 10). but which is also the same case with 5+5=10, 3+3+3+1=10.. different lengths affect what type of hash value I get, more likely to get collisions.
tpae