tags:

views:

445

answers:

8

Hi All,

For an application I need to generate unique serial numbers for each English word.

What would be the best approach?

One constraint is serial number generation algorithm should be very effective in an ordinary desktop computer.

Thanks

+3  A: 

Just use a 64-bit hash function, like Fowler-Noll-Vo. You're not likely to get collisions using a 64-bit integer, as this gives you 2^64 possible values, and there are certainly way less than that many words in the English language. You'd need to normalize each word, of course, (convert to lower-case, etc.)

Charles Salvia
Thanks to the birthday paradox you should expect to get collision starting at 2^32 (= 4 billion), and that should be still OK.
jug
You could even run over your dictionary, test whether there are any collisions, and if so pick another hash. The problem with that is if you want to add new words in future, they might create collisions after it's too late to change the hash (this problem also applies when constructing a perfect hash).
Steve Jessop
+3  A: 

Do you really need it to be 'serial'? if not - did you try to use the various hash algorithms? Several of them are built into .NET (MD5 and SHA1 if I remember correctly). I am not sure which one will be good enough especially with short strings

mfeingold
+5  A: 

You appear to be asking about a perfect hashing function. If so, take a look at this Wikipedia article, and at the gperf utility.

anon
I don't think a perfect hash function is necessarily a good idea here, as he'd have to rerun gperf everytime a new word was added.
Charles Salvia
Who knows? He doesn't say how he is going to use the "serial number". Maybe he wants to encode a fixed size dictionary.
anon
+6  A: 

Do you have a list of all possible words? If yes, start from 0 at the first word and increment the serial by 1 for each word.

If not then a simple way to guarantee they are unique is to use the word itself as the serial. For example, ABC = 0x41 0x42 0x43 = 4276803. As suggested in the comments there are other ways (that however require more work), such as compressing the words first with, for example, Huffman.

This of course gets awkward with long words: The serial of Pneumonoultramicroscopicsilicovolcanoconiosis would require around 100 digits, for example.

Otherwise you can use a hash, but there is no guarantee it will be unique for all English words.

Andreas Bonini
The first paragraph is the only answer so far that meets the 'serial' requirement.
Ewan Todd
I don't understand why this is being downrated. If he has a list of all possible words my solution (an incremental serial) is much better than using a "perfect hash" (like the answer with 3 points suggested), if he doesn't than like I said either he can accept a small risk of collisions or use the word itself as the serial..
Andreas Bonini
Just guessing, but perhaps the downvotes are about the too-strong claim in the second para, that "the only way" is using your ad-hoc algorithm. Clearly, that is not the only way.
Ewan Todd
@Ewan: While I agree that there is possibly another way of doing it, I can see no answer yet that proposes a method that produces a guaranteed unique key for each word. Using a good enough hashing function the risk should be negligable but collissions can still occur, right?
kigurai
For instance another way is to compress the string, then use the compressed result as a unique value in the same way you use the uncompressed string as a value. Finding a good compression scheme for such short strings takes a little work, but one way is to build a Huffman tree based on letter frequencies calculated from a longish list of English words
Steve Jessop
I edited my answer to include other possible ways; thanks for the feedback.
Andreas Bonini
+4  A: 

Here is an algorithm (in python) that allows you to code and decode any combination of lowercase letters:

def encode(s):
  r = 1
  for i in len(s):
    r = r * 26 + (ord(s[i]) - ord('a'))
  return r

Using 64 bits you can code up to 12 letter words. You can use the remaining unused serials as in index to a table containing low-frequency very long words.

Alexandru
A: 

About about MD5 hash algorithm. Do something like this:

serialNumber = MD5( ToLower ( english word ) )
Nick Berardi
+1  A: 

Are you looking for every word, or every word in the English dictionary? Are you using standard words - i.e. from the Oxford English Dictionary or are slang words included too? I guess what I'm getting at is: "How big is your dictionary"? You could use an MD5 hash which has a theoretical possibility of collisions - albeit 1 in billions of hashes that may collide - although, I can't say I'd understand the purpose of using a hash over using the actual word. Unless perhaps you're wanting to calculate the serial client side so that it's referencing a correct dictionary item on the server side without having to parse the dictionary looking for its serial. Of course - the word obviously has to be sufficiently unique in order for us to understand it as humans, and we're way more efficient at parsing the meaning of words than a computer is at doing the same.

Are you looking to separate words that look the same but are pronounced differently? Words that look and sound the same but have different meanings? If so, then you're going to come unstuck with a hash, as the same spelling with a different semantic will produce the same hash, so it won't work for this scenario. In this case you'd need some kind of incremental system. If you add words after the fact to the dictionary, will they be added at the end and just given the next serial number in sequence? What if that word is spelled the same as another word but sounds different or sounds the same but has a different semantic? What then?

I guess it depends on the purpose of the serialization as to what would be the most suitable output for your serial number and hence what would be the most efficient algorithm.

The most efficient algorithm would probably be to split your dictionary into the same number of chunks as you have processors and have a thread on each processor serialize the words in its chunk recombining the output from each thread at the end. This (in theory) would work at a speed slightly slower than O(n/number of processors) in real world performance, however I think for mathematical correctness that's still O(n) because you still have to parse the whole dictionary once to serialize each word.

I think the safest way to go is:

  • Worry about what you've got now
  • Order them in the most logical sequence (alphabetically?)
  • Number them in sequence
  • Add new words (whether spelled the same or not and having different semantics) at the end; give them the next number in the sequence, regardless of their rightful place in the dictionary alphabetically.

This way you don't have to worry about leaving spaces in the serial numbers to account for insertions between words, you don't have to worry about reindexing any dependent data to account for changes in indexes when words are inserted, you just carry on as normal. You don't have to worry about collisions, and you still get the most efficient indexing mechanism for storage purposes meaning you're not storing MD5 hashes that are potentially longer than the original word - which makes no sense for real world use.

If you need to access the dictionary alphabetically, just sort by the word, otherwise, don't.

I still think I'm at a loss as to the necessity of serializing the word - except for storage purposes where you can store your dictionary and link tables by the word's key.

BenAlabaster
A: 

I wonder if an answer is even possible.

Are color and colour the same word? Do they get one serial number or two?

Are polish and Polish the same word?

Are watch (noun) and watch (verb) the same word?

Are multiply (verb) and multiply (adverb) the same word?

Analysis (singular noun) and analyses (plural noun) are not the same word. Are analyse (plural verb) and analyze (plural verb) the same word? Are analyses (singular verb) and analyzes (singular verb) the same word? Are analyses (singular verb) and analyses (plural noun) the same word?

Are wont and won't the same word?

Are Beijing and Peking the same word? Or maybe they aren't English, since Londres and Frankreich aren't English, but then what is the English word for the capital of the Middle Country?

Windows programmer
The 'OP' asked for a desktop indexer. Word context/semantic interpretation is usually left for the searcher to handle. Managing punctuation handling can be a configuration option. So 'Yes' I would say it is possible, until such time we can have semantically interpreted search results, we have to live with compromise.
andora