String hashing with linear probing

views:

391

answers:

+1 Q:

String hashing with linear probing

Hello,

I am stuck trying to figure out how to do string hashing with linear probing.

Basically, the idea is to hash every string from a dictionary (90000 words), and retrieve anagrams of selected words.

Here's what I did:

created a hash table 2*90000 in size
using a simple hash function, I hash each word from the dictionary, get a value
check if that hash table index is empty, if it is, assign the value, if not, generate a new hash value.
after every word is in the hash table, and I perform a search
the search word will receive a hash value after the hash function, and it will be checked whether that value exists in the hash table or not.
if it exists, it will compare the string using permutations. if the match is true, it will output it. if not, it will keep looking using a new hash value.

problem is, the whole process is extremely slow... it indexes fine, but searching takes REALLY long time.

I am out of ideas on how to make this faster..

Thank you for your time reading this.

Are you attempting to create Anagrams of a given string? In that case, just create an anagram on getting the string as input. What's the point in hashing those strings?

EDIT: First of all I would suggest you to get all the permutations of the given string and then loop through the dictionary file having all the words. This has the advantage that you don't need to have all the words in memory. If you want to optimize further, sort the entire file based on the string lengths by ascending or descending order and keep checking for the strings in the dictionary file till you are <= to the next string's length.

Jagannath 2010-01-25 05:50:06

to search whether that anagram is a valid word or not

Neeraj 2010-01-25 05:53:29

The anagrams needs to be from the dictionary. Since there's 90000 words, it should be slower to traverse through them one by one using permutations..

tpae 2010-01-25 05:53:33

+3 A:

Put all the letters in alphabetical order first, then hash the result with any hashing algorithm you please (crc32, md5sum, sha1, count the vowels, anything... though counting the vowels will lead to a less-efficient solution), and store the word as a leaf node to that hash entry (in a linked list, obviously) -- do a mod(x) on the hash result to limit the buckets to 2^x.

Then, when you go to find an anagram, do the exact same "insert" procedure on your test word: alphabetize the letters, then run it through your same hash function. Then for each leaf node, compare the alphabetized letter list with the saved word's alphabetized list. Each match is an anagram.

(I normally don't like to give homework help, but this one was too tempting. Now I kind of want to go write a fun little program to find all the anagrams in a given dictionary.)

tylerl 2010-01-25 06:03:03

The question is about linear probing - he should store the word in the next open bucket, not in the (non-existent) leaf node. But your suggestion to sort the letters is better than adapting the hash algorithm itself.

Joe 2010-01-25 06:27:09

+1 for sorting the letters, for anagrams it really makes sense and you even get the benefit that all anagrams ends up in the same bucket.

Matthieu M. 2010-01-25 12:31:26

+1 A:

Linear Probing is used in the case when the hash function you are using gives collision for some input string.In that case you search sequentially the hash table until you find your search key.

This method has a disadvantage in that if there is one collision,there will be more.

One way is that you can use Separate Chaining and use balanced trees as buckets to improve lookup.

If you just want to improve performance, make sure that there are no collisions (in that case lookup is perfectly O(1)) and if there are, increase the hashtable size.

Neeraj 2010-01-25 06:09:40

Sorry, but homework requires me to use linear probing.. :(

tpae 2010-01-25 07:22:59

If you are searching each permute of your input word, this is the source of your problem. The number of permutations of the letters of a word can get quite large.

Instead, pick a hash function which is the same for any permutation (anagram) of a word. For example, one based on the sum of character codes of the characters in the word.

ergosys 2010-01-25 06:21:11

While this solves the immediate problem, it's an extremely poor algorithm for hashing - there will be a ton of collisions. A better solution is the one proposed by tylerl, which can use a good hashing algorithm and still cope with the large number of permutations.

Joe 2010-01-25 06:25:51

I didn't say what the hash function was, just that it was the same for any permutation. If you sort the characters first, this satisfies my criterion. Guilty of a bad example though.

ergosys 2010-01-25 06:30:10

This won't work, since not all words are same length.

tpae 2010-01-25 06:30:35

If they aren't the same length, how can they be anagrams? Or am I misunderstanding the problem?

ergosys 2010-01-25 06:47:38

basically, all the words in the dictionary are different lengths. Let's say the word I am searching is a "step", the word "pets" is in the dictionary. let's say, (in example, "step" is 1+2+3+4 = 10). but which is also the same case with 5+5=10, 3+3+3+1=10.. different lengths affect what type of hash value I get, more likely to get collisions.

tpae 2010-01-25 07:22:28

ansaurus

tags:

views:

answers:

String hashing with linear probing

related questions